本文介绍了如果值在范围内,则awk更新文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个file2,在-之前有〜1400 $5个未知"值.我想做的是使用file2$2中的文本来更新file1中的那些未知"值.在file1$1中,如果file2$4范围内有一组数字,可用于更新未知".我真的不确定从哪里开始,但是也许下面的awk是一个开始,或者可能有更好的方法.谢谢:).

I have a file2 with ~1400 $5 values before the - that are "unknown". What I am trying to do is use the text in $2 of file2 to update those "unknown" values in file1. In $1 of file1 there are a set of numbers that can be used to update the "unknown" if it in the range of $4 of file2. I am really not sure where to start but maybe the awk below is a start or there is probably a better way. Thank you :).

文件1

       `$1`           `$2`
 chr6:3224495-3227968 TUBB2B
 chr16:89988417-90002505 TUBB3

文件2

chr16   89985657    89986630    chr16:89985657-89986630 MC1R-2270|gc=63.5
chr16   89989779    89989898    chr16:89989779-89989898 unknown-2271|gc=73.9
chr16   89998969    89999097    chr16:89998969-89999097 unknown-2272|gc=57
chr16   89999866    89999996    chr16:89999866-89999996 unknown-2273|gc=55.4
chr16   90001127    90002222    chr16:90001127-90002222 unknown-2274|gc=63.9
chr17   1173848 1174575 chr17:1173848-1174575   BHLHA9-3|gc=78.7

所需的输出(unknown updated to TUBB3 because the TUBB3 because the $4 value is within the range of $1).

chr16   89985657    89986630    chr16:89985657-89986630 MC1R-2270|gc=63.5
chr16   89989779    89989898    chr16:89989779-89989898 TUBB3-2271|gc=73.9
chr16   89998969    89999097    chr16:89998969-89999097 TUBB3-2272|gc=57
chr16   89999866    89999996    chr16:89999866-89999996 TUBB3-2273|gc=55.4
chr16   90001127    90002222    chr16:90001127-90002222 TUBB3-2274|gc=63.9
chr17   1173848 1174575 chr17:1173848-1174575   BHLHA9-3|gc=78.7

awk

awk '
NR == FNR {min[$1]=$4; next}
{
    for (id in min)
        if ([id] = $5 && [id]) {
            print $0, id
            break
        }
}
' file1 file2
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
                       rstart[a[1]]=a[2]
                       rend[a[1]]=a[3]
                       value[a[1]]=$2
                       next}
 $5~/unknown/ && $2>=rstart[$1] && $3<=rend[$1]
                      {sub(/unknown/,value[$1],$5)}1' file1 file2 |
column -t > output


chr16   89985657    89986630    chr16:89985657-89986630 MC1R-2270|gc=63.5
chr16   89989779    89989898    chr16:89989779-89989898 unknown-2271|gc=73.9
chr16   89989779    89989898    chr16:89989779-89989898 TUBB3-2271|gc=73.9
chr16   89998969    89999097    chr16:89998969-89999097 unknown-2272|gc=57
chr16   89998969    89999097    chr16:89998969-89999097 TUBB3-2272|gc=57
chr16   89999866    89999996    chr16:89999866-89999996 unknown-2273|gc=55.4
chr16   89999866    89999996    chr16:89999866-89999996 TUBB3-2273|gc=55.4
chr16   90001127    90002222    chr16:90001127-90002222 unknown-2274|gc=63.9
chr16   90001127    90002222    chr16:90001127-90002222 TUBB3-2274|gc=63.9
chr17   1173848 1174575 chr17:1173848-1174575   BHLHA9-3|gc=78.7

推荐答案

awk进行救援!

$ awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
                           rstart[a[1]]=a[2]
                           rend[a[1]]=a[3]
                           value[a[1]]=$2
                           next}
     $5~/unknown/ && $2>=rstart[$1] && $3<=rend[$1]
                          {sub(/unknown/,value[$1],$5)}1' file1 file2 |
  column -t

chr16  89985657  89986630  chr16:89985657-89986630  MC1R-2270|gc=63.5
chr16  89989779  89989898  chr16:89989779-89989898  TUBB3-2271|gc=73.9
chr16  89998969  89999097  chr16:89998969-89999097  TUBB3-2272|gc=57
chr16  89999866  89999996  chr16:89999866-89999996  TUBB3-2273|gc=55.4
chr16  90001127  90002222  chr16:90001127-90002222  TUBB3-2274|gc=63.9
chr17  1173848   1174575   chr17:1173848-1174575    BHLHA9-3|gc=78.7

修改原始间距,以便以表格格式传递到column -t.

modifies the original spacing so piped to column -t for tabular format.

这篇关于如果值在范围内,则awk更新文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 17:36