问题描述
你好,我有一个Fasta文件,例如:
Hello I have a fasta file such as :
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence2 [virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence3
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence5 hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence7 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
在这个文件中,我想删除重复的序列并得到:
And in this file I would like to remove duplicated sequence and get :
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
在这里您可以看到sequence1_CP
,sequence2
和sequence3
的> name
之后的包含内容相同,那么我只想保留其中的3个.但是,如果3个序列之一具有一个_CP
的名称,那么我要特别保留此名称.如果其中任何一个都不包含_CP
,则不会保留我所保留的一个.
Here as you can see the containt after the > name
for sequence1_CP
, sequence2
and sequence3
is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP
in its name, then I want to keep this one especially. If there is none _CP
in any of them it does not mater wich one I keep.
- 因此对于
Sequence1_CP
,Sequence2
和Sequence3
之间的第一个重复项,我保留sequence1_CP
- 对于
sequence4_CP
和sequence5
之间的第二个重复项,我保留sequence4_CP
- 对于sequence6和
sequence7
之间的第三次重复,我保留了第一个sequence6
- So for the first duplicates between
Sequence1_CP
,Sequence2
andSequence3
I keepsequence1_CP
- For the second duplicates between
sequence4_CP
andsequence5
I keepsequence4_CP
- And for the third duplicates between sequence6 and
sequence7
I keep the first onesequence6
有人使用biopython或bash方法有想法吗?非常感谢
Does someone have an idea using biopython or a bash method ?Thanks a lot
推荐答案
您可以使用以下awk单行代码:
You could use this awk one-liner:
$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file
输出:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
上面的
依赖于观察到的顺序是_CP
顺序在样本中的其他顺序之前.如果实际上并非如此,请使用以下内容.它存储每个序列的第一个实例,如果找到_CP
序列,则该实例将被覆盖:
Above relies on observation that the sequences are in order where the _CP
s come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP
sequence is found:
$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file
或采用精美印刷:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
if(!($2,$3) in seen||$1~/^[^ ]+_CP /)
seen[$2,$3]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
输出顺序是awk的默认值,即.似乎是随机的.
The output order is awk default ie. appears random.
更新如果在这种情况下@kvantour的BOTH注释均有效,请使用以下awk:
Update If @kvantour's BOTH comments are valid in this case, use this awk:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
for(i=2;i<=NF;i++)
k=(i==2?"":k) $i
if(!(k in seen)||$1~/^[^ ]+_CP /)
seen[k]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
现在输出:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
这篇关于删除重复的Fasta序列(biopython方法的重击)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!