问题描述
我有2个文件.一个是fasta文件,其中包含多个fasta序列,而另一个文件中包含我要搜索的候选序列的名称(下面的文件示例).
I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
我想使用AWK搜索fasta文件,并获取其名称保存在另一个文件中的给定候选者的所有fasta序列.
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
问题是我只能这样提取name.txt中第一个候选者的序列
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
任何人都可以帮助修复上面的单行awk命令吗?
Can anyone help to fix one-line awk command above?
推荐答案
如果可以,甚至还可以打印名称,则可以简单地使用grep
:
If it is ok or even desired to print the name as well, you can simply use grep
:
grep -Ff name.txt -A1 a.fasta
-
-f name.txt
从name.txt
中选择模式 -
-F
将它们视为文字字符串而不是正则表达式 -
A1
打印匹配的行以及下一行 -f name.txt
picks patterns fromname.txt
-F
treats them as literal strings rather than regular expressionsA1
prints the matching line plus the subsequent line
如果在输出中不需要名称,我将简单地通过管道传送到另一个grep
:
If the names are not desired in output I would simply pipe to another grep
:
above_command | grep -v '>'
一个awk
解决方案可以如下所示:
An awk
solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
在多行版本中进行更好的解释:
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
这篇关于给定第二个包含序列名称的文件,使用AWK搜索fasta文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!