问题描述
我有一个很大的文件变体表( variation.txt ). 染色体编号中的第一列,变异的位置.我有第二个文件 annotation.txt ,其中包含37,000个基因(第一列),其染色体编号(第二列),开始和结束坐标(第三列),然后是一些详细信息
I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start and end coordinates (3rd column), followed by some details
我必须将变异(基于染色体数目及其位置)分配给基因.首先,它应该在两个文件中查找匹配的染色体编号,如果匹配,则变异的坐标应在基因的起始和终止位置内(包括).我在python中尝试过它,但是它需要很长时间.此外,我希望具有如下所示的修改后的输出.基因可以具有重叠的坐标,并且给定的变异可以是多个重叠基因的一部分.请帮助.
I have to assign the variations (based on chromosome number and its position) to the genes. First, it should look for the matching chromosome number in both files, and if that matches, the coordinate of the variation should be within (including) start and end position of the gene. I have attempted it in python but its taking a long time. Moreover, I want to have a modified output as shown below. Genes can have overlapping coordinates and a given variation can be part of multiple overlapping genes. Kindly help.
variation.txt
SL3.0ch02 702679 C A - - - - - - - -
SL3.0ch01 711131 A G - - - - - - - -
SL3.0ch00 715124 G A - - - - - - - -
SL3.0ch00 719289 C T - - - - - - - -
SL3.0ch00 720926 A C - - - - - - - -
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED
annotation.txt
Solyc00g005000.3.1 SL3.0ch02 702600 702900 + Eukaryotic aspartyl protease family protein
Solyc00g005040.3.1 SL3.0ch01 715100 715200 + Potassium channel
Solyc00g005050.3.1 SL3.0ch00 715150 715300 - UPF0664 stress-induced protein C29B12.11c
Solyc00g005060.1.1 SL3.0ch00 723741 724013 - LOW QUALITY:Cyclin/Brf1-like TBP-binding protein
Solyc00g005080.2.1 SL3.0ch00 723800 723900 - LOW QUALITY:Protein Ycf2
Solyc00g005084.1.1 SL3.0ch05 809593 813633 + UDP-Glycosyltransferase superfamily protein
Solyc00g005090.1.1 SL3.0ch07 1061632 1061916 - LOW QUALITY:DYNAMIN-like 1B
Solyc00g005092.1.1 SL3.0ch01 1127794 1144385 + Serine/threonine phosphatase-like protein
Solyc00g005094.1.1 SL3.0ch00 1144958 1146952 - Glucose-6-phosphate 1-dehydrogenase 3, chloroplastic
Solyc00g005096.1.1 SL3.0ch00 1734562 1736567 + RWP-RK domain-containing protein
所需的输出:
SL3.0ch02 702679 C A - - - - - - - - Solyc00g005000.3.1
SL3.0ch00 715124 G A - - - - - - - - Solyc00g005040.3.1
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) Solyc00g005060.1.1
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) Solyc00g005080.2.1
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED Solyc00g005060.1.1
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED Solyc00g005080.2.1
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED Solyc00g005060.1.1
代码:
import re
file1 = open("variation", "r")
file2 = open("annotation.txt", "r")
probe_id = file1.read().splitlines()
loc_id = file2.read().splitlines()
for i in probe_id:
i=i.rstrip()
probe_info=i.split('\t')
probe_info[1]=probe_info[1].strip()
probe_info[0]=probe_info[0].strip()
#print probe_info[1]
gene_list=[]
for j in loc_id:
loc_info=j.split('\t')
loc_info[2]=loc_info[2].strip()
loc_info[3]=loc_info[3].strip()
if loc_info[1]==probe_info[0]:
if (int(probe_info[1]) >= int(loc_info[2])):
if (int(probe_info[1]) <=int(loc_info[3])):
gene_list.append(loc_info[0])
if len(gene_list)!=0:
print i+"\t"+str(gene_list)
当前输出:
SL3.0ch02 702679 C A - - - - - - - - ['Solyc00g005000.3.1']
SL3.0ch00 715124 G A - - - - - - - - ['Solyc00g005040.3.1']
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED ['Solyc00g005060.1.1']
推荐答案
这是GNU awk的开始,它与染色体编号和范围内的位置匹配:
This is a start for GNU awk that matches the chromosome numbers and the position in the range:
$ awk '
NR==FNR {
a[$2][$3 " " $4]=$0 # store the annotations
next
}
($1 in a){ # if chromosome found
for(i in a[$1]) # process all the ranges
if(split(i,t)&&$2>=t[1]&&$2<=t[2]) # if there is a match
print # output
}' anno vari
输出atm:
SL3.0ch02 702679 C A - - - - - - - -
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED
这篇关于用于处理(基因)坐标文件的脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!