本文介绍了Fasta文件读取python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取具有以下格式的FASTA文件:

I am reading a FASTA file that has a format like this:


>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF

我必须读取文件,然后计算JC距离(对于一对序列,JC距离为-3/4 * ln(1-4-3 * p),其中p是一对之间有差异

I have to read the file and then calculate the JC distance (For a pair of sequences, the JC distance is -3/4 * ln(1 - 4/3 * p), where p is the proportion of sites that differ between the pair)

我已经设置了它的框架,但是不确定如何做其余的事情.读取并计算JukesCantor距离之后,我必须将其写入新的输出文件中,并且应该在表格中我能得到的任何帮助都非常感激!谢谢,python和fasta文件的新手

I have set up the skeleton of it but am unsure how to do the rest. AFter reading and calculating the JukesCantor distance I have to write it to a new output file and it should be in a tableany help i can get is much appreciated! thanks, new to python AND fasta files

def readData():
    filename = input("Enter the name of the FASTA file: ")
    infile = open(filename, "r")

def CalculateJC(x,y):
    if x == y:
        return 0
    else:
        return 1 # temporary*

def calcDists(seqs):
    output = []
    for seq1 in seqs:
        newrow = []
        for seq2 in seqs:
            dist = calculateJS(seq1,seq2)
            newrow.append(dist)
        output.append(newrow)
        list(enumerate(seasons))
    return output


def outputDists(distMat):
    pass

def main():
    seqs = readData()
    distMat = calcDists(seqs)
    outputDists(distMat)



if__name__ == "__main__":
    main()

推荐答案

您一次提出的问题太多!专注于一个.

You are asking too many questions at a time! Focus on one.

读写FASTA文件在 BioPython 中进行了介绍(如注释中所建议).

Reading and writing FASTA files is covered in BioPython (as suggested in comments).

我注意到您尚未计算JC距离,所以也许这是您需要帮助的地方.这是我想出的:

I noticed that you aren't calculating your JC distance yet, so perhaps this is where you need help.Here is what I came up with:

import itertools, math

def computeJC(seq1, seq2):
    equal = 0
    for base1, base2 in itertools.izip(seq1, seq2):
        equal += (base1 == base2)
    p = equal / float(len(seq1))
    return -3/4 * math.log(1 - 4/3 * p)

此处解释了itertools.izip技巧:这段代码可与任何类型的字符串一起使用,并且当seq1或seq2到达末尾时,外观将停止.

The itertools.izip trick is explained here: How can I iterate through two lists in parallelThis piece of code will work with any kind of string, and the look will stop when either seq1 or seq2 reaches the end.

其他人可能会提出"Pythonic单线",但请先尝试理解我的方法.它避免了代码陷入的陷阱:嵌套循环,不必要的分支,运行时列表增长,意大利面条式代码等等.享受!

Someone else may come up with a "Pythonic one-liner", but try to understand my approach first. It avoids the pitfalls that your code felt into: nested loops, unnecessary branching, runtime list growing, spaghetti code to name a few. Enjoy!

这篇关于Fasta文件读取python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-26 09:42
查看更多