问题描述
我正在读取具有以下格式的FASTA文件:
I am reading a FASTA file that has a format like this:
>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF
我必须读取文件,然后计算JC距离(对于一对序列,JC距离为-3/4 * ln(1-4-3 * p),其中p是一对之间有差异
I have to read the file and then calculate the JC distance (For a pair of sequences, the JC distance is -3/4 * ln(1 - 4/3 * p), where p is the proportion of sites that differ between the pair)
我已经设置了它的框架,但是不确定如何做其余的事情.读取并计算JukesCantor距离之后,我必须将其写入新的输出文件中,并且应该在表格中我能得到的任何帮助都非常感激!谢谢,python和fasta文件的新手
I have set up the skeleton of it but am unsure how to do the rest. AFter reading and calculating the JukesCantor distance I have to write it to a new output file and it should be in a tableany help i can get is much appreciated! thanks, new to python AND fasta files
def readData():
filename = input("Enter the name of the FASTA file: ")
infile = open(filename, "r")
def CalculateJC(x,y):
if x == y:
return 0
else:
return 1 # temporary*
def calcDists(seqs):
output = []
for seq1 in seqs:
newrow = []
for seq2 in seqs:
dist = calculateJS(seq1,seq2)
newrow.append(dist)
output.append(newrow)
list(enumerate(seasons))
return output
def outputDists(distMat):
pass
def main():
seqs = readData()
distMat = calcDists(seqs)
outputDists(distMat)
if__name__ == "__main__":
main()
推荐答案
您一次提出的问题太多!专注于一个.
You are asking too many questions at a time! Focus on one.
读写FASTA文件在 BioPython 中进行了介绍(如注释中所建议).
Reading and writing FASTA files is covered in BioPython (as suggested in comments).
我注意到您尚未计算JC距离,所以也许这是您需要帮助的地方.这是我想出的:
I noticed that you aren't calculating your JC distance yet, so perhaps this is where you need help.Here is what I came up with:
import itertools, math
def computeJC(seq1, seq2):
equal = 0
for base1, base2 in itertools.izip(seq1, seq2):
equal += (base1 == base2)
p = equal / float(len(seq1))
return -3/4 * math.log(1 - 4/3 * p)
此处解释了itertools.izip技巧:这段代码可与任何类型的字符串一起使用,并且当seq1或seq2到达末尾时,外观将停止.
The itertools.izip trick is explained here: How can I iterate through two lists in parallelThis piece of code will work with any kind of string, and the look will stop when either seq1 or seq2 reaches the end.
其他人可能会提出"Pythonic单线",但请先尝试理解我的方法.它避免了代码陷入的陷阱:嵌套循环,不必要的分支,运行时列表增长,意大利面条式代码等等.享受!
Someone else may come up with a "Pythonic one-liner", but try to understand my approach first. It avoids the pitfalls that your code felt into: nested loops, unnecessary branching, runtime list growing, spaghetti code to name a few. Enjoy!
这篇关于Fasta文件读取python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!