给定第二个包含序列名称的文件，使用AWK搜索fasta文件

本文介绍了给定第二个包含序列名称的文件，使用AWK搜索fasta文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有2个文件.一个是fasta文件，其中包含多个fasta序列，而另一个文件中包含我要搜索的候选序列的名称(下面的文件示例).

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).

seq.fasta

>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC

name.txt

Clone_23
Clone_27-1

我想使用AWK搜索fasta文件，并获取其名称保存在另一个文件中的给定候选者的所有fasta序列.

I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.

awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta

问题是我只能这样提取name.txt中第一个候选者的序列

The problem is that I can only extract the sequence of first candidate in name.txt, like this

>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA

任何人都可以帮助修复上面的单行awk命令吗?

Can anyone help to fix one-line awk command above?

推荐答案

如果可以，甚至还可以打印名称，则可以简单地使用grep:

If it is ok or even desired to print the name as well, you can simply use grep:

grep -Ff name.txt -A1 a.fasta

-f name.txt从name.txt
-F将它们视为文字字符串而不是正则表达式
A1打印匹配的行以及下一行

-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line

如果在输出中不需要名称，我将简单地通过管道传送到另一个grep:

If the names are not desired in output I would simply pipe to another grep:

above_command | grep -v '>'

一个awk解决方案可以如下所示:

An awk solution can look like this:

awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta

在多行版本中进行更好的解释:

Better explained in a multiline version:

# True as long as we are reading the first file, name.txt
NR==FNR {
    # Store the names in the array 'n'
    n[$0]
    next
}

# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline

这篇关于给定第二个包含序列名称的文件，使用AWK搜索fasta文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！