问题描述
我有两个文件
基因列表文件如下
LOC_Os06g12230.1
Pavir.Ab03005
Pavir.J14065
ChrUn.fgenesh
Sevir.1G325700
LOC_Os02g51280.1
Bradi3g59320
Brast04G017400
Fasta序列文件如下所示
Fasta sequence file looks like this
>LOC_Os03g57190.1 pacid=33130570 polypeptide=LOC_Os03g57190.1 locus=LOC_Os03g57190 ID=LOC_Os03g57190.1.MSUv7.0 annot-version=v7.0
ATGGAGGCGGCGGTGGGGGACGGGGAAGGCGGTGGCGGCGGCGGCGGGCGGGGGAAGCGTGGGCGGGGAGGAGGAGGAGG
GGAGATGGTGGAGGCGGTGTGGGGGCAGACGGGGAGTACGGCGTCGCGGATCTACAGGGTGAGGGCGACGGGGGGGAAGG
ACAGGCACAGCAAGGTGTACACGGCGAAGGGAATCCGCGACCGCCGCGTCCGCCTCTCCGTCGCCACCGCCATCCAGTTC
TACGACCTCCAGGACCGCCTCGGCTTCGACCAGCCGAGCAAGGCCATCGAGTGG
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
>LOC_Os05g43760.1 pacid=33158388 polypeptide=LOC_Os05g43760.1 locus=LOC_Os05g43760 ID=LOC_Os05g43760.1.MSUv7.0 annot-version=v7.0
ATGACAAGCAATAACAGCACGAATGAGGAGCTCGGCGGCGGCGGCAGGAAGGCGGCCGACAAGCCGAGCGGCGGCGGCGG
CGCCGCCGCCGCCGTGGCGAGCTCGCGGCACTGGTCGGCGTCGACGGAGTCGCGGATCGTGCGCGTGTCGAGGGTGTTCG
GCGGCAAGGACCGTCACAGCAAGGTGAGGACGGTGAAGGGGCTCCGCGACCGGCGGGTGCGGCTGTCGGTGCCGACGGCG
ATCCAGCTCTACGACCTGCAGGACCGGCTGGGGCTCAGCCAGCCGAGCAAGGTGGTCGACT
如果基因名称和标题行匹配,则必须将序列提取到新文件中
if the gene name and header line matches then, sequence has to be pulled out into new file
新文件应包含
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
我已经尝试过了
grep -f genelist.txt -A3 fastafile.txt >> newfasta.txt
但是不同的fasta序列具有不同的长度,
but different fasta sequences have different lengths,
模式匹配后,我要选择直到下一个'>'符号出现
After pattern match, i want to pick till next '>' symbol appears
推荐答案
使用awk处理FASTA文件的最简单方法是建立一个名为name
的变量和一个名为seq
的变量.每次阅读完整序列时,都可以对其进行处理.请注意,为了最好的处理方式,应将序列存储为连续字符串,并且不应包含任何换行符或空格.用于处理Fasta的通用awk如下所示:
The easiest way to process FASTA files with awk, is to build up a variable called name
and a variable called seq
. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or whitespaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { process_sequence_here }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { process_sequence_here }' file.fasta
您可以通过引入几个函数来简化此过程:
You can make this a bit easier by introducing a couple of functions:
awk '/^>/ && seq { process_sequence(name_seq) }
/^>/{name=substr($0,2); seq=""; next}
{seq = seq $0 }
END { process_sequence(name,seq) }
BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
function print_sequence(name,seq) {
gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
}
function process_sequence(name,seq) { ... }
' file.fasta
对于OP,以上内容将显示为:
In case of the OP, the above would read:
awk '(NR==FNR) { a[$0]; next }
/^>/ && seq { process_sequence(name_seq) }
/^>/{name=substr($0,2); seq=""; next}
{seq = seq $0 }
END { process_sequence(name,seq) }
BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
function print_sequence(name,seq) {
gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
}
function process_sequence(name,seq) {
$0=name; if ($1 in a) print_sequence (name,seq)
}
' list.txt file.fasta
使用awk处理fasta文件时,您始终可以考虑使用 bioawk .它具有POSIX awk的所有功能,但经过增强后可以轻松处理FASTA文件:
When you process fasta-files with awk, you can always concider to use bioawk. It has all the bells-and-whistles from POSIX awk, but is augmented to easily process FASTA files:
这篇关于如何从基因列表中选择多个Fasta序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!