Near-optimal RNA-Seq quantification https://pachterlab.github.io/kallisto

输入输出文件说明：http://bio.math.berkeley.edu/eXpress/manual.html

kallisto：Near-optimal RNA-Seq quantification-LMLPHP

文章标题：

Pseudoalignment for metagenomic read assignment

文章摘要：

We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data. In particular, we show that the recent idea of pseudoalignment introduced in the RNA-Seq context is suitable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software.

文章地址：

https://arxiv.org/abs/1510.07371v2

源代码：

https://pachterlab.github.io/kallisto/about

安装：

wget https://github.com/pachterlab/kallisto/releases/download/v0.43.0/kallisto_linux-v0.43.0.tar.gz

测试：

[[email protected] test]$ /project/metagenomics_benchmark/kallisto_linux-v0.43.0/kallisto index -i --index transcripts.fasta

[[email protected] test]$ /project/metagenomics_benchmark/kallisto_linux-v0.43.0/kallisto quant -i --index -o output reads_1.fastq reads_2.fastq（输入文件）

[[email protected] output]$ more abundance.tsv

target_id length eff_length est_counts tpm

NM_001168316 2283 2105.9 160.606 12581

NM_174914 2385 2207.9 1500.72 112128

NR_031764 1853 1675.9 102.671 10106.2

NM_004503 1681 1503.9 331.118 36320.7

NM_006897 1541 1363.9 664 80311.3

NM_014212 2037 1859.9 55 4878.25

NM_014620 2300 2122.9 591.166 45937.9

NM_017409 1959 1781.9 47 4351.17

NM_017410 2396 2218.9 42 3122.5

NM_018953 1612 1434.9 227.999 26212.1

NM_022658 2288 2110.9 4881 381446

NM_153633 1666 1488.9 361.044 40002.4

NM_153693 2072 1894.9 73.6719 6413.67

NM_173860 849 671.903 962 236189

NR_003084 1640 1462.9 0.00164208 0.18517

使用说明：

kallisto

kallisto是一个用高通量测序片段从ＲＮＡ序列或更为普遍的目标序列中量化转录丰富度的一个程序。它是基于伪对齐的新的数据，用于快速确定reads目标，而无需alignment。在标准的ＲＮＡ序列数据中，kallisto能够在mac系统上用不到十分钟的时间构建索引，用不到三分钟的时间量化（也就是分类）３千ｗ人类的reads。reads伪对齐保留关键信息需要量化，并且kallisto不仅速度快，而且比现有的量化工具准确。事实上，由于伪对齐的过程是对reads出错上的健壮性，在许多基准中kallisto显著优于现有的工具。

kallisto能够用sleuth量化RNA序列分析。

kallisto产生的使用选项，这是一个列表：

kallisto 0.43.0

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index #构建一个kallisto索引

    quant         Runs the quantification algorithm #运行量化分析算法

    pseudo        Runs the pseudoalignment step#运行为比对

    h5dump        Converts HDF5-formatted results to plaintext#格式转换

    version       Prints version information#输出版本信息

    cite          Prints citation information#引用信息

Running kallisto <CMD> without arguments prints usage information for <CMD>

关于这些command说明如下：

index ：

kallisto index建立从靶序列的FASTA格式的文件的索引。该指数命令的参数有：

kallisto 0.43.0

Builds a kallisto index

Usage: kallisto index [arguments] FASTA-files#输入文件

Required argument: #必选参数

-i, --index=STRING          Filename for the kallisto index to be constructed #kallisto索引被构建的文件名

Optional argument:

-k, --kmer-size=INT         k-mer (odd) length (default: 31, max value: 31)

    --make-unique           Replace repeated target names with unique names

输入文件为fasta格式，可以是压缩文件。

quant：

kallisto quant运行量化算法。对于定量命令的参数有：

kallisto 0.43.0

Computes equivalence classes for reads and quantifies abundances#对reads进行分类和物种丰富度评估

Usage: kallisto quant [arguments] FASTQ-files #输入文件

Required arguments: #必选参数

-i, --index=STRING            Filename for the kallisto index to be used for

                              quantification  #索引文件

-o, --output-dir=STRING       Directory to write output to  #输出文件目录

Optional arguments:

    --bias                    Perform sequence based bias correction

-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)

    --seed=INT                Seed for the bootstrap sampling (default: 42)

    --plaintext               Output plaintext instead of HDF5

    --single                  Quantify single-end reads

    --fr-stranded             Strand specific reads, first read forward

    --rf-stranded             Strand specific reads, first read reverse

-l, --fragment-length=DOUBLE  Estimated average fragment length

-s, --sd=DOUBLE               Estimated standard deviation of fragment length

                              (default: value is estimated from the input data)

-t, --threads=INT             Number of threads to use (default: 1)

    --pseudobam               Output pseudoalignments in SAM format to stdout

kallisto可以处理单端或双端的序列，默认情况下是双端序列，输入为fastq文件：

kallisto quant -i index -o output pairA_1.fastq pairA_2.fastq pairB_1.fastq pairB_2.fastq

对于单端序列可以用选项 --single ，也可用用 -l 和 -s 选项，然后列出输入的fastq文件即可：

kallisto quant -i index -o output --single -l 200 -s 20 file1.fastq.gz file2.fastq.gz file3.fastq.gz

kallisto quant produces three output files by default:

kallisto定量分析默认产生三个输出文件：

abundances.h5 ：二进制文件，包含运行信息，物种丰富度评估，bootstrap 评估等这个文件可以被sleuth打开阅读。
abundances.tsv ：是一个物种丰富度的说明文件。
run_info.json ：是一个包含运行的相关信息

可选参数说明：

Pseudobam：
--pseudobam，所有的伪比对输出格式为格式。可以被定向到一个文件中，也可以用samtools转换成bam。

例如： kallisto quant -i index -o out --pseudobam r1.fastq r2.fastq > out.sam

或者用samtools：

kallisto quant -i index -o out --pseudobam r1.fastq r2.fastq | samtools view -Sb - > out.bam



　　　　　　　　　　　　　　　　　　（学校的秋天，哈哈）

pseudo

kallisto pseudo只是在伪比对这一环节运行并且其目的是为在单细胞RNA的序列的使用。pseudo详细的命令选项如下：

kallisto 0.43.0

Computes equivalence classes for reads and quantifies abundances

Usage: kallisto pseudo [arguments] FASTQ-files

Required arguments:

-i, --index=STRING            Filename for the kallisto index to be used for

                              pseudoalignment

-o, --output-dir=STRING       Directory to write output to

Optional arguments:

-u  --umi                     First file in pair is a UMI file

-b  --batch=FILE              Process files listed in FILE

    --single                  Quantify single-end reads

-l, --fragment-length=DOUBLE  Estimated average fragment length

-s, --sd=DOUBLE               Estimated standard deviation of fragment length

                              (default: value is estimated from the input data)

-t, --threads=INT             Number of threads to use (default: 1)

    --pseudobam               Output pseudoalignments in SAM format to stdout

该命令的格式和参数的含义是与quant命令相同。然而，pseudo不运行EM算法来量化丰度。此外pseudo指令有一个选项在批处理文件中指定许多细胞，如：

kallisto pseudo -i index -o output -b batch.txt

h5dump

kallisto h5dump转换 hdf5格式。对于h5dump命令的参数有：

kallisto 0.43.0

Converts HDF5-formatted results to plaintext

Usage:  kallisto h5dump [arguments] abundance.h5

Required argument:

-o, --output-dir=STRING       Directory to write output to

kallisto：Near-optimal RNA-Seq quantification-LMLPHP