问题描述
我经常需要找到一个FASTA文件中的特定序列,并打印出来。对于那些不知道是谁,FASTA是生物序列(DNA,蛋白质等)的文本文件格式。这是pretty简单,你有一个'>'pceded序列名称$ P $行,然后按照直到下一个'>'是序列本身的所有行。例如:
I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:
>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG
目前我得到我所需要的序列的方法是使用grep有-A,所以我会做
The way I'm currently getting the sequence I need is to use grep with -A, so I'll do
grep -A 10 sequence_name filename.fa
,然后如果我没有看到文件中的下一个序列的开始,我会改的10到20个,并重复,直到我敢肯定,我得到了整个序列。
and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.
好像应该有更好的方式来做到这一点。例如,我可以问它打印,直到下一个'>'字符?
It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?
推荐答案
使用>
作为记录分隔符:
awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
这篇关于从FASTA文件打印序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!