用于处理(基因)坐标文件的脚本

本文介绍了用于处理(基因)坐标文件的脚本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的文件变体表( variation.txt ). 染色体编号中的第一列，变异的位置.我有第二个文件 annotation.txt ，其中包含37,000个基因(第一列)，其染色体编号(第二列)，开始和结束坐标(第三列)，然后是一些详细信息

I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start and end coordinates (3rd column), followed by some details

我必须将变异(基于染色体数目及其位置)分配给基因.首先，它应该在两个文件中查找匹配的染色体编号，如果匹配，则变异的坐标应在基因的起始和终止位置内(包括).我在python中尝试过它，但是它需要很长时间.此外，我希望具有如下所示的修改后的输出.基因可以具有重叠的坐标，并且给定的变异可以是多个重叠基因的一部分.请帮助.

I have to assign the variations (based on chromosome number and its position) to the genes. First, it should look for the matching chromosome number in both files, and if that matches, the coordinate of the variation should be within (including) start and end position of the gene. I have attempted it in python but its taking a long time. Moreover, I want to have a modified output as shown below. Genes can have overlapping coordinates and a given variation can be part of multiple overlapping genes. Kindly help.

variation.txt

SL3.0ch02   702679  C   A   -   -   -   -   -   -   -   -
SL3.0ch01   711131  A   G   -   -   -   -   -   -   -   -
SL3.0ch00   715124  G   A   -   -   -   -   -   -   -   -
SL3.0ch00   719289  C   T   -   -   -   -   -   -   -   -
SL3.0ch00   720926  A   C   -   -   -   -   -   -   -   -
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS     NONSYNONYMOUS   W/G     52  0   novel   DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     49  1   novel   TOLERATED
SL3.0ch00   723903  T   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     37  1   novel   TOLERATED

annotation.txt

Solyc00g005000.3.1  SL3.0ch02   702600  702900  +   Eukaryotic aspartyl protease family protein
Solyc00g005040.3.1  SL3.0ch01   715100  715200  +   Potassium channel
Solyc00g005050.3.1  SL3.0ch00   715150  715300  -   UPF0664 stress-induced protein C29B12.11c
Solyc00g005060.1.1  SL3.0ch00   723741  724013  -   LOW QUALITY:Cyclin/Brf1-like TBP-binding protein
Solyc00g005080.2.1  SL3.0ch00   723800  723900  -   LOW QUALITY:Protein Ycf2
Solyc00g005084.1.1  SL3.0ch05   809593  813633  +   UDP-Glycosyltransferase superfamily protein
Solyc00g005090.1.1  SL3.0ch07   1061632 1061916 -   LOW QUALITY:DYNAMIN-like 1B
Solyc00g005092.1.1  SL3.0ch01   1127794 1144385 +   Serine/threonine phosphatase-like protein
Solyc00g005094.1.1  SL3.0ch00   1144958 1146952 -   Glucose-6-phosphate 1-dehydrogenase 3, chloroplastic
Solyc00g005096.1.1  SL3.0ch00   1734562 1736567 +   RWP-RK domain-containing protein

所需的输出:

SL3.0ch02   702679  C   A   -   -   -   -   -   -   -   -   Solyc00g005000.3.1
SL3.0ch00   715124  G   A   -   -   -   -   -   -   -   -   Solyc00g005040.3.1
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS NONSYNONYMOUS   W/G 52  0   novel   DELETERIOUS (*WARNING! Low confidence)  Solyc00g005060.1.1
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS NONSYNONYMOUS   W/G 52  0   novel   DELETERIOUS (*WARNING! Low confidence)  Solyc00g005080.2.1
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS SYNONYMOUS  G/G 49  1   novel   TOLERATED   Solyc00g005060.1.1
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS SYNONYMOUS  G/G 49  1   novel   TOLERATED   Solyc00g005080.2.1
SL3.0ch00   723903  T   C   Solyc00g005060.1    CDS SYNONYMOUS  G/G 37  1   novel   TOLERATED   Solyc00g005060.1.1

代码:

import re
file1 = open("variation", "r")
file2 = open("annotation.txt", "r")
probe_id = file1.read().splitlines()
loc_id = file2.read().splitlines()

for i in probe_id:
    i=i.rstrip()
    probe_info=i.split('\t')
    probe_info[1]=probe_info[1].strip()
    probe_info[0]=probe_info[0].strip()
    #print probe_info[1]
    gene_list=[]
    for j in loc_id:
        loc_info=j.split('\t')
        loc_info[2]=loc_info[2].strip()
        loc_info[3]=loc_info[3].strip()
        if loc_info[1]==probe_info[0]:
            if (int(probe_info[1]) >= int(loc_info[2])):
                 if (int(probe_info[1]) <=int(loc_info[3])):
                    gene_list.append(loc_info[0])
    if len(gene_list)!=0:
        print i+"\t"+str(gene_list)

当前输出:

SL3.0ch02   702679  C   A   -   -   -   -   -   -   -   -   ['Solyc00g005000.3.1']
SL3.0ch00   715124  G   A   -   -   -   -   -   -   -   -   ['Solyc00g005040.3.1']
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS NONSYNONYMOUS   W/G 52  0   novel   DELETERIOUS (*WARNING! Low confidence)  ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS SYNONYMOUS  G/G 49  1   novel   TOLERATED   ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00   723903  T   C   Solyc00g005060.1    CDS SYNONYMOUS  G/G 37  1   novel   TOLERATED   ['Solyc00g005060.1.1']

file

用于处理(基因)坐标文件的脚本

问题描述

推荐答案