做基因组注释
先用augustus训练,然后再用maker做基因注释
augustus提供一些训练好的,如果有和你的物种非常接近的,直接用提供的,没有的话再自己训练。
网址:
http://bioinf.uni-greifswald.de/augustus/
老版本下载:
http://bioinf.uni-greifswald.de/augustus/binaries/old/
最后选择下载2.7的 新版本3.2的实在是装不上 太麻烦了!!!!!
下载好后,解压,cd src, sudo make, 然后:
vi ~/.bash_profile
AUGUSTUS_CONFIG_PATH=/home/cmiao/augustus.2.7/config/
export AUGUSTUS_CONFIG_PATH
source ~/.bash_profile
sudo cp /home/cmiao/augustus.2.7/bin/augustus /usr/local/bin/
安装pslCDnaFilter. 如果没有的话。
WARNING: Could not successfully find and run pslCDnaFilter. Please install this program.
安装一下pslCDnaFilter 再试试
下载地址:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.v287/pslCDnaFilter
很多的软件都可以在这里下载:
文件准备:
参考基因组
cDNA
都准备好后,执行命令:
~/augustus.2.7/scripts/autoAug.pl --species=Carya --genome=../Carya.fa --cdna=../Carya_400cDNA.fa --singleCPU
报错:
1 ####### Step 1: Training AUGUSTUS (no UTR models) #######
Error: missing training file!
原因: 如果没有gff文件,必须加--pasa
PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.
PASA的安装见pasa安装博客
安装好后执行:
~/augustus.2.7/scripts/autoAug.pl --species=Carya --genome=../Carya.fa --cdna=../Carya_400cDNA.fa --singleCPU --pasa
如果你的物种有近缘物种组装的比较好和注释比较好的基因组和gff,可以去训练金源物种的,比如我是核桃,我选择桃子,在pythozome上下载genome and gff for trainning
也可以在线分析
在线训练网址:
http://bioinf.uni-greifswald.de/webaugustus/training/create
You have to give a species name(不能有空格!), and a genome file!
关于参考基因组 和cDNA fasta文件的head要求:
- no whitespaces in the headers
- no special characters in the headers (e.g. !#@&|;)
- make the headers as short as possible
- let headers not start with a number but with a letter
- let headers contain letters and numbers, only
In the following we give some header examples that will not cause problems:
>entry1
>contig1000
>est20
>scaffold239
详细的在线训练指导:
http://bioinf.uni-greifswald.de/webaugustus/trainingtutorial.gsp
如果在线训练基因组大小和cDNA大小均不能超过100M。可以选取参考序列和cDNA中较长的序列,总大小小于100M
报错:
Failed to execute, possible reasons could be: 1. There is already a database named "PASAtrainBKY7KMFm" in your mysql host. 2. The software "slclust" is not installed correctly, try to install it again (see the details in the PASA documentation). 3. The fasta headers in cDNA or genome file were not unique. Inspect /data/www/augtrain/webdata/trainBKY7KMFm/autoAug/trainingSet/pasa/Launch_PASA_pipeline.stderr for PASA error messages.
最后检查文件,发现是cDNA里header有重复。并且重复的名字序列并不同,写个脚本解决~
python /share/Public/off_zhangliangsheng/checkHeaderEditName.py your_fa_file
再次提交任务。