最近需要做转座子分析,查找发现可以使用 TransposonPSI 来进行分析。但是登陆官网,该软件 update 时间为 2013 年,但是因为时间紧迫,暂时还没有进行其他方法的调研,所以先选用该软件进行了分析。
一、TransposonPSI 安装及使用
1. TransposonPSI 安装
官网: http://transposonpsi.sourceforge.net
下载地址:https://sourceforge.net/projects/transposonpsi/
压缩包非常小,只有 10M 左右,解压后修改主角本 transposonPSI.pl 中三个软件的路径(blastall, formatdb, blastpgp),即可食用。
目录结构:
README
docs/
PerlLib/
scripts/
transposon_ORF_lib/
transposon_PSI_LIB/
misc/
transposonPSIcreate/
TransposonWeb/
transposonPSI.pl
test/
2. TransposonPSI 使用入门
直接进入 test 目录,执行 runMe.sh 即可进行测试,非常简单。查看 runMe.sh 发现,输入文件是我们需要进行分析的数据序列,nuc 表示核酸序列,prot 表示蛋白序列。
if [ -e target_test_genome_seq.fasta.gz ] && ! [ -e target_test_genome_seq.fasta ]
then
gunzip target_test_genome_seq.fasta.gz
fi ../transposonPSI.pl target_test_genome_seq.fasta nuc
runMe.sh
二、TransposonPSI 流程解读
对 transposonPSI.pl 进行 Linux 脚本复现
cd /Transposon/div_step/
if [ -d tmp ]
then
rm -rf tmp
fi mkdir tmp
cd tmp
ln -s ../target_test_genome_seq.fasta
/software/blast-2.2./bin/formatdb -i target_test_genome_seq.fasta -p F cd /Transposon/div_step/tmp TPSI_list=(
cacta
DDE_1
gypsy
hAT
helitron
ISa
ISb
isc1316
line
ltr_Roo
mariner_ant1
mariner
MuDR
P_element
piggybac
TY1_Copia
TyrRecombinaseCrypton
) for int in {..}
do
name=${TPSI_list[$int]}
/software/blast-2.2./bin/blastall -i /Transposon/TransposonPSI_08222010/transposon_PSI_LIB/$name.refSeq -d target_test_genome_seq.fasta -p psitblastn -R /Transposon/TransposonPSI_08222010/transposon_PSI_LIB/$name.chk -F F -M BLOSUM62 -t - -e 1e- -v -b >target_test_genome_seq.$name.psitblastn
/Transposon/TransposonPSI_08222010/scripts/BPbtab </Transposon/div_step/tmp/target_test_genome_seq.$name.psitblastn> /Transposon/div_step/tmp/target_test_genome_seq.$name.psitblastn.btab
done cat /Transposon/div_step/tmp/*btab | sort -rn -k13 >/Transposon/div_step/target_test_genome_seq.TPSI.allHits cd /Transposon/div_step/
perl /Transposon/TransposonPSI_08222010/scripts/TBLASTN_hit_chainer.pl target_test_genome_seq.TPSI.allHits btab >target_test_genome_seq.TPSI.allHits.chains
perl /Transposon/TransposonPSI_08222010/scripts/TPSI_btab_to_gff3.pl target_test_genome_seq.TPSI.allHits.chains >target_test_genome_seq.TPSI.allHits.chains.gff3
perl /Transposon/TransposonPSI_08222010/scripts/TBLASTN_hit_chainer_nonoverlapping_genome_DP_extraction.pl target_test_genome_seq.TPSI.allHits.chains >target_test_genome_seq.TPSI.allHits.chains.bestPerLocus
perl /Transposon/TransposonPSI_08222010/scripts/TPSI_chains_to_gff3.pl target_test_genome_seq.TPSI.allHits.chains.bestPerLocus >target_test_genome_seq.TPSI.allHits.chains.bestPerLocus.gff3
work.sh
1. 格式化序列数据库
这是 blast 比对的首要步骤,这里我就不多介绍了,详细的参数和使用说明很多大佬都有介绍,使用时百度即可。
/software/blast-2.2.26/bin/formatdb -i target_test_genome_seq.fasta -p F
2. 数据库列表准备
TPSI_list=(
cacta
DDE_1
gypsy
hAT
helitron
ISa
ISb
isc1316
line
ltr_Roo
mariner_ant1
mariner
MuDR
P_element
piggybac
TY1_Copia
TyrRecombinaseCrypton
)
TPSI_list
以上列表为各类转座子名称,它们存在于 transposon_PSI_LIB/ 目录中,每一种数据库有三种格式:refSeq,chk,chkp
3. 序列与各数据库进行比对
/software/blast-2.2.26/bin/blastall \
-i /Transposon/TransposonPSI_08222010/transposon_PSI_LIB/$name.refSeq \
-d target_test_genome_seq.fasta \
-p psitblastn -R /Transposon/TransposonPSI_08222010/transposon_PSI_LIB/$name.chk \
-F F \
-M BLOSUM62 \
-t -1 \
-e 1e-5 \
-v 10000 \
-b 10000 \
>target_test_genome_seq.$name.psitblastn
特殊参数
-R PSI-TBLASTN checkpoint file [File In] Optional
得到比对结果。
4. BPbtab
/Transposon/TransposonPSI_08222010/scripts/BPbtab \
</Transposon/div_step/tmp/target_test_genome_seq.$name.psitblastn> \
/Transposon/div_step/tmp/target_test_genome_seq.$name.psitblastn.btab
将比对结果转化为 btab 格式:
cacta PSITBLASTN target_test_genome_seq.fasta genome 81.8 81.8 54.5 Plus 1e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 68.3 78.0 52.9 Plus 4e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 89.7 89.7 52.2 Plus 7e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 75.8 75.8 49.8 Plus 3e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 68.4 71.1 49.8 Plus 3e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 29.7 39.8 49.5 Plus 4e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 70.3 73.0 49.1 Plus 6e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 76.5 79.4 48.7 Plus 7e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 33.3 46.5 48.7 Plus 7e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 55.6 64.4 48.3 Plus 1e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 73.5 73.5 47.2 Plus 2e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 58.5 60.4 47.2 Plus 2e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 62.8 62.8 46.0 Plus 5e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 61.9 61.9 44.1 Plus 2e-
cacta PSITBLASTN target_test_genome_seq.fasta genome 62.5 62.5 43.7 Plus 3e-
target_test_genome_seq.cacta.psitblastn.btab
BPtab 是一个Blast输出解析器, 脚本将 WU-BLAST 或 NCBI-BLAST 输出文件解析为BTAB格式,其中每个 HSP 报告为带有制表符分隔字段的单行。
5. 统计比对结果
cat /Transposon/div_step/tmp/*btab | sort -rn -k13 >/Transposon/div_step/target_test_genome_seq.TPSI.allHits
TY1_Copia PSITBLASTN target_test_genome_seq.fasta genome 27.2 46.4 Plus
helitronORF PSITBLASTN target_test_genome_seq.fasta genome 53.4 64.2 Plus
Crypton PSITBLASTN target_test_genome_seq.fasta genome 40.7 57.7 Plus 1.18E-139
TY1_Copia PSITBLASTN target_test_genome_seq.fasta genome 35.3 56.2 Plus 1.00E-129
helitronORF PSITBLASTN target_test_genome_seq.fasta genome Plus 1.00E-125
gypsy PSITBLASTN target_test_genome_seq.fasta genome 26.7 46.7 Plus 1.18E-109
cacta PSITBLASTN target_test_genome_seq.fasta genome 81.8 81.8 54.5 Plus 1.18E-04
target_test_genome_seq.TPSI.allHits
BTAB 格式的具体内容并为完全掌握,暂时不提。
6. 共线性 HSPs 关联
perl /Transposon/TransposonPSI_08222010/scripts/TBLASTN_hit_chainer.pl \
target_test_genome_seq.TPSI.allHits btab \
>target_test_genome_seq.TPSI.allHits.chains
collinear HSPs are chained together into larger chains (more complete element regions).
将共线性的 HSP 连接在一起,形成 larger chains,例如下面的文件,会将线性相关的放在一起
#Chain Crypton - genome - + 46.3
Crypton PSITBLASTN target_test_genome_seq.fasta genome 32.6 47.9 85.5 #Chain Crypton - genome - + 92.7
Crypton PSITBLASTN target_test_genome_seq.fasta genome 42.5 57.5 66.2
Crypton PSITBLASTN target_test_genome_seq.fasta genome 33.5 50.0 #Chain Crypton - genome - + 148.0
Crypton PSITBLASTN target_test_genome_seq.fasta genome 37.9 55.7
Crypton PSITBLASTN target_test_genome_seq.fasta genome 34.8 48.2
target_test_genome_seq.TPSI.allHits.chains
7. 将 larger chains 转 gff3 格式
perl /Transposon/TransposonPSI_08222010/scripts/TPSI_btab_to_gff3.pl \
target_test_genome_seq.TPSI.allHits.chains \
>target_test_genome_seq.TPSI.allHits.chains.gff3
8. best chains :提取分数最高的 chains
perl /Transposon/TransposonPSI_08222010/scripts/TBLASTN_hit_chainer_nonoverlapping_genome_DP_extraction.pl \
target_test_genome_seq.TPSI.allHits.chains \
>target_test_genome_seq.TPSI.allHits.chains.bestPerLocus
9. best chains 转 gff3 格式
perl /Transposon/TransposonPSI_08222010/scripts/TPSI_chains_to_gff3.pl \
target_test_genome_seq.TPSI.allHits.chains.bestPerLocus \
>target_test_genome_seq.TPSI.allHits.chains.bestPerLocus.gff3
即为最终结果。