



  • MHAP自纠(falcon也是用MHAP,SMRT的HGAP使用的是另一种速度慢的自纠算法,自纠的核心是多重序列比对)
  • CCS
  • Quiver,Arrow
  • Sparc,PBdagcon
  • PacBioToCA(之前说错了,这是个二代纠三代的算法)
  • Proovread
  • LoRDEC
  • Ectools
  • LSC



  • PBcR全称是什么?  PacBio Corrected Reads (PBcR) pipeline
  • PBcR是哪个单位开发的? 不是PacBio官方,而是University of Maryland,Center for Bioinformatics and Computational Biology
  • PBcR的程序长什么样?PBcR只是一个名称,上面都说明了,这只是一个pipeline,程序全称Whole-Genome Shotgun Assembler,简称 wgs-8.3rc2,这个程序全部是由Perl脚本组成,大部分都是调用其他子程序(SAMtools、Jellyfish、pbutgcns、pbdagcon、BLASR、FALCON等)。
  • 怎么调用PacBioToCA? PacBioToCA只是PBcR里的一个功能,运行“./PBcR”,设置参数就可以使用PacBioToCA了(如下)。
  • PBcR、HGAP和SMRT区别?  PBcR、CA 和 CANU 区别?   Celera Assembler、Canu和PBcR基本可以认为是一个东西的不同版本;HGAP是PacBio官方的组装pipeline,SMRT是单分子实时测序的简称。

PacBioToCa 是 PBcR 程序里的一个子程序,专门用来进行二代纠三代,PacBioToCa 顾名思义就是将 PB 转换成 CA 能利用的格式,从而进行后续的组装。


source /ifs4/BC_PUB/biosoft/pipeline/DNA/DNA_Denovo/PacBio/WGS-8.3/wgs-.3rc2/Linux-amd64/bin/sampleData/setup.sh
/ifs4/BC_PUB/biosoft/pipeline/DNA/DNA_Denovo/PacBio/WGS-8.3/wgs-.3rc2/Linux-amd64/bin/PBcR_V2 -sensitive -length -partitions -l chr22_60X -s ./pacbio.SGE.spec -threads -genomeSize -maxGap -fastq /ifs4/BC_RD/USER/lizhixin/my_project/PacBio_reads/PB_chr22.fastq

pacBioToCA github

/tools/wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname illumina -technology illumina -reads illumina.fastq > illumina.frg
/tools/wgs-7.0/Linux-amd64/bin/pacBioToCA -length -partitions -l ec_pacbio -t -s pacbio.spec \
-fastq pacbio.filtered_subreads.fastq illumina.frg > run.out >&

PBcR 马里兰大学

PBcR SourceForge

pacBioToCA wiki (三篇文献,其他链接)

PacBio sequence error correction amd assemble via pacBioToCA

自纠要求:目前看最低需要15X,再低就无法自纠了,自纠貌似可以解决嵌合体,但是自纠的核心缺点就是 PB 变短,数据量减少了一半。



MHAP - MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities

MinHash - 是LSH的一种,可以用来快速估算两个集合的相似度。

MinHash 在PBcR中被用来快速的寻找overlap,它和 DALIGNER 的功能是一样的,但是底层的算法不一样,显然这两种算法都不是直接的两两比对,因为这种复杂度是无法接受的。


Complete-Striped-Smith-Waterman-Library (好屌!早就有人实现并打包了,所以现在大部分的编程只需要组合和打包别人的程序)



HGAP and PBcR self-correction





Circular consensus sequencing

PacificBiosciences/unanimity - Consensus library and applications(这个软件可以直接做CCS)




ccs takes multiple reads of the same SMRTbell sequence and combines them, employing a statistical model, to produce one high quality consensus sequence.

三代PacBio reads纠错 - 专题-LMLPHP

3.Quiver & Arrow

PacificBiosciences/GenomicConsensus 可以用的工具

bax2bam -f test.fofn -o subreads2
pbalign subreads.bam ref.fa mapped.bam
source setup.sh
samtools faidx ref.fa
arrow --algorithm=arrow -v -j8 mapped.bam -r ref.fa -o ref.arrowed.fq

Quiver is the legacy consensus model based on a conditional random field approach.(HGAP final "assembly polishing" step)

Arrow is an improved consensus model based on a more straightforward hidden Markov model approach.


4.Sparc & PBdagcon

PacificBiosciences/pbdagcon  -  A sequence consensus algorithm implementation based on using directed acyclic graphs to encode multiple sequence alignment


source setup.sh
blasr query.fa ref.fa -bestn -m -out mapped.m5
Sparc m mapped.m5 b ref.fa k c g t 0.2 o consensus.fa


proovread – github

发表论文:proovread: large-scale high-accuracy PacBio correction through iterative short read consensus - 2014

通过迭代短read consensus来进行大规模的高准确度的PacBio纠错

本软件是对 PacBioToCA 和 LSC 的优化,PacBioToCA 丢失了>40%的数据,必须安装CA,在集群上运行,LSC主要是开发来用于人转录组的纠错。

a new SMRT sequencing correction pipeline:

  • 能在普通电脑和集群上运行
  • 可以应用到不同场合(基因组、转录组)
  • 不损失准确度、长度和数据量

三代PacBio reads纠错 - 专题-LMLPHP

三代PacBio reads纠错 - 专题-LMLPHP

The implementation of the theoretical model strongly depends on the used mapping software.

As default, proovread uses SHRiMP2 (David et al., 2011) for mapping. Its versatile interface allowed us to completely implement the hybrid scoring model with the following parameters: insertions are the most frequent errors and are penalized as gap open with –1. Deletions occur about half as often and are thus penalized with –2. Extensions for insertions and deletions are scored with –3 and –4, respectively. Mismatches are at least 10 times as rare, resulting in a penalty of –11 (Supplementary Table S1). All results presented here have been generated using these settings with SHRiMP2 version 2.2.3.

SHRiMP - SHort Read Mapping Package 软件主页

SHRiMP2 使用说明



LoRDEC: a hybrid error correction program for long, PacBio reads

发表论文:LoRDEC: accurate and efficient long read error correction  - 2014

We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy.

三代PacBio reads纠错 - 专题-LMLPHP


Github源程序:Ectools - tools for error correction and working with long read data(有详细操作步骤)


发表论文:Error correction and assembly complexity of single molecule sequencing reads – 2014


整体来说,这个纠错算法是使用 unitigs(二代reads组装而来),来对三代长 reads 进行纠错。

  1. 将二代 reads 组装成 unitigs;输出 organism.utg.fasta
  2. 创建工作目录 mkdir organism_correct;创建软链接 ln -s /path/to/organism.utg.fasta;
  3. 过滤掉短于 1kb 的长 reads,保证有多于 20X 的数据用于纠错
  4. 将三代 reads 拆成多个部分,python ${ECTOOLS_HOME}/partition.py 20 500 pbreads.legnth_filtered.fa;
  5. 复制纠错脚本到工作目录,cp ${ECTOOLS_HOME}/correct.sh .;修改 correct.sh 中的全局变量
  6. 安装 nucmer
  7. 逐个运行;$> for i in {0001..000N}; do cd $i; qsub -cwd -j y -t 1:${NUM_FILES_PER_PARTITION} ../correct.sh; cd ..;
  8. 完成后合并纠错结果; cat ????/*.cor.fa > organism.cor.fa
  9. Use convert-fasta-to-v2.pl to make celera frg file from organism.cor.fa(可选)

From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction.

三代PacBio reads纠错 - 专题-LMLPHP


LSC - a long read error correction tool(for RNA-Seq)

发表论文:Improving PacBio Long Read Accuracy by Short Read Alignment

LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. 均聚物压缩

三代PacBio reads纠错 - 专题-LMLPHP

三代PacBio reads纠错 - 专题-LMLPHP

可以好好看看biostars上的一个帖子:Question: What tools you use or know for PacBio Long Read error correction?



Pacific Biosciences – PB github大本营     Bioinformatics Workshop - PB流程

chimera formation 嵌合体

PacBio RS - 知乎精华

Identify adapter sequences in pacbio reads    BBMap - BBMap short read aligner, and other bioinformatic tools.

BBTools - 官网

SEQ 的 PacBio专题:http://seqanswers.com/forums/archive/index.php/f-39.html


