问题描述
我有一个file2
,在-
之前有〜1400 $5
个未知"值.我想做的是使用file2
的$2
中的文本来更新file1
中的那些未知"值.在file1
的$1
中,如果file2
的$4
范围内有一组数字,可用于更新未知".我真的不确定从哪里开始,但是也许下面的awk
是一个开始,或者可能有更好的方法.谢谢:).
I have a file2
with ~1400 $5
values before the -
that are "unknown". What I am trying to do is use the text in $2
of file2
to update those "unknown" values in file1
. In $1
of file1
there are a set of numbers that can be used to update the "unknown" if it in the range of $4
of file2
. I am really not sure where to start but maybe the awk
below is a start or there is probably a better way. Thank you :).
文件1
`$1` `$2`
chr6:3224495-3227968 TUBB2B
chr16:89988417-90002505 TUBB3
文件2
chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5
chr16 89989779 89989898 chr16:89989779-89989898 unknown-2271|gc=73.9
chr16 89998969 89999097 chr16:89998969-89999097 unknown-2272|gc=57
chr16 89999866 89999996 chr16:89999866-89999996 unknown-2273|gc=55.4
chr16 90001127 90002222 chr16:90001127-90002222 unknown-2274|gc=63.9
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7
所需的输出(unknown updated to TUBB3 because the TUBB3 because the $4 value is within the range of $1
).
chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5
chr16 89989779 89989898 chr16:89989779-89989898 TUBB3-2271|gc=73.9
chr16 89998969 89999097 chr16:89998969-89999097 TUBB3-2272|gc=57
chr16 89999866 89999996 chr16:89999866-89999996 TUBB3-2273|gc=55.4
chr16 90001127 90002222 chr16:90001127-90002222 TUBB3-2274|gc=63.9
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7
awk
awk '
NR == FNR {min[$1]=$4; next}
{
for (id in min)
if ([id] = $5 && [id]) {
print $0, id
break
}
}
' file1 file2
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$5~/unknown/ && $2>=rstart[$1] && $3<=rend[$1]
{sub(/unknown/,value[$1],$5)}1' file1 file2 |
column -t > output
chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5
chr16 89989779 89989898 chr16:89989779-89989898 unknown-2271|gc=73.9
chr16 89989779 89989898 chr16:89989779-89989898 TUBB3-2271|gc=73.9
chr16 89998969 89999097 chr16:89998969-89999097 unknown-2272|gc=57
chr16 89998969 89999097 chr16:89998969-89999097 TUBB3-2272|gc=57
chr16 89999866 89999996 chr16:89999866-89999996 unknown-2273|gc=55.4
chr16 89999866 89999996 chr16:89999866-89999996 TUBB3-2273|gc=55.4
chr16 90001127 90002222 chr16:90001127-90002222 unknown-2274|gc=63.9
chr16 90001127 90002222 chr16:90001127-90002222 TUBB3-2274|gc=63.9
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7
推荐答案
awk
进行救援!
$ awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$5~/unknown/ && $2>=rstart[$1] && $3<=rend[$1]
{sub(/unknown/,value[$1],$5)}1' file1 file2 |
column -t
chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5
chr16 89989779 89989898 chr16:89989779-89989898 TUBB3-2271|gc=73.9
chr16 89998969 89999097 chr16:89998969-89999097 TUBB3-2272|gc=57
chr16 89999866 89999996 chr16:89999866-89999996 TUBB3-2273|gc=55.4
chr16 90001127 90002222 chr16:90001127-90002222 TUBB3-2274|gc=63.9
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7
修改原始间距,以便以表格格式传递到column -t
.
modifies the original spacing so piped to column -t
for tabular format.
这篇关于如果值在范围内,则awk更新文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!