问题描述
我具有这样的数据总是来自于四个块
按以下格式(称为FASTQ):
I have a data in that always comes in block of fourin the following format (called FASTQ):
@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
有没有一个简单的sed / awk的/ bash的方式将它们转换成
这种格式(称为FASTA):
Is there a simple sed/awk/bash way to convert them intothis format (called FASTA):
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
在原则上,我们要提取的前两行的每个块的-4-
和替换 @
与方式&gt;
In principle we want to extract the first two lines in each block-of-4and replace @
with >
.
推荐答案
这是一个老问题,也有过许多提供了不同的解决方案。由于接受的答案使用SED但有一个突出的问题(这是它会与>当@符号出现质量行的第一个字母代替@),我觉得有必要提供一个简单的sed式的解决方案,实际工作
This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:
sed -n '1~4s/^@/>/p;2~4p'
做的唯一的假设是,每一次读中占有FASTQ文件中恰好有4行,但似乎pretty安全的,在我的经验。
The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.
在fastx工具包中的fastq_to_fasta脚本也适用。 (值得一提,你需要指定-Q33选项,以适应现在常见的PHRED + 33 QUAL编码。这很有趣,因为它反正扔掉质量数据!)
The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)
这篇关于转换FASTQ与SED / AWK FASTA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!