问题描述
我正在使用Python和正则表达式来查找ORF
(开放阅读框).
I am using Python and a regular expression to find an ORF
(open reading frame).
在子字符串中找到一个字符串,该字符串仅由以下字母组成:ATGC
(无空格或换行):
Find a sub-string a string that is composed ONLY of the letters ATGC
(no spaces or new lines) that:
以ATG
开头,以TAG
或TAA
或TGA
结尾,并应考虑从第一个字符开始的顺序,然后是第二个字符,然后是第三个字符:
Starts with ATG
, ends with TAG
or TAA
or TGA
and should consider the sequence from the first character, then second and then third:
Seq= "CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGC
AAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGA
GCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGG
CGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGT
GATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGAT
CATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTG
AGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTA
CCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTG
CTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAG
CTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAA
TCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCA
TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT
CTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCC
CTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGC
CGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCC
CACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATA
ACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACG
GGTCATTGAGAGAAATAA"
我尝试过的事情:
# finding the stop codon here
def stop_codon(seq_0):
for i in range(0,len(seq_0),3):
if (seq_0[i:i+3]== "TAA" and i%3==0) or (seq_0[i:i+3]== "TAG" and i%3==0) or (seq_0[i:i+3]== "TGA" and i%3==0) :
a =i+3
break
else:
a = None
# finding the start codon here
startcodon_find =[m.start() for m in re.finditer('ATG', seq_0)]
我如何找到一种检查起始密码子然后找到第一个终止密码子的方法.随后找到下一个起始密码子和下一个终止密码子.
How can I find a way to check the start codon and then find the first stop codon. Subsequently find the next start codon and the next stop codon.
我希望将其运行三帧.如前所述,这三个帧会将序列的第一个,第二个和第三个字符作为开始.
I wish to run this for three frames. As mentioned earlier the three frames would be considering the first, second and third characters of the sequence as the start.
还需要将序列分为3的小部分.因为它应该是这样的:
Also the sequence needs to be divided into small parts of 3. There for it should be some thing like this:
ATG TTT AAA ACA AAA TAT TTA TCT AAC CCA ATT GTC TTA ATA ACG CTG ATT TGA
任何帮助将不胜感激.
我的最终答案:
def orf_find(st0):
seq_0=""
for i in range(0,len(st0),3):
if len(st0[i:i+3])==3:
seq_0 = seq_0 + st0[i:i+3]+ " "
ms_1 =[m.start() for m in re.finditer('ATG', seq_0)]
ms_2 =[m.start() for m in re.finditer('(TAA)|(TAG)|(TGA)', seq_0)]
def get_next(arr,value):
for a in arr:
if a > value:
return a
return -1
codons = []
start_codon=ms_1[0]
while (True):
stop_codon = get_next(ms_2,start_codon)
if stop_codon == -1:
break
codons.append((start_codon,stop_codon))
start_codon = get_next(ms_1,stop_codon)
if start_codon==-1:
break
max_val = 0
selected_tupple = ()
for i in codons:
k=i[1]-i[0]
if k > max_val:
max_val = k
selected_tupple = i
print "selected tupple is ", selected_tupple
final_seq=seq_0[selected_tupple[0]:selected_tupple[1]+3]
print final_seq
print "The longest orf length is " + str(max_val)
output_file = open('Longorf.txt','w')
output_file.write(str(orf_find(st0)))
output_file.close()
上述写入功能对将内容写入文本文件没有帮助.我进入的所有内容都没有..为什么会出现此错误..任何人都可以帮忙吗?
The above write function does not help me in writing the content on to a text file . All i get in there is NONE.. Why this error .. Can anybody Help ?
推荐答案
您已将其标记为Biopython,我想您知道Biopython.你检查过文件了吗? http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc231 可能有帮助.
As you have tagged it Biopython I suppose you know of Biopython. Have you checked out the docu yet? http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc231 might help.
我对上述链接中的代码进行了一些调整,以根据您的顺序进行操作:
I adjusted the code from the above link a bit to work on your sequence:
from Bio.Seq import Seq
seq = Seq("CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGCAAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGGCGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGTGATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGATCATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTGAGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTACCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTGCTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAGCTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAATCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCATCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGATCTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCCCTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGCCGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCCCACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATAACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACGGGTCATTGAGAGAAATAA")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
for frame in range(3):
for pro in nuc[frame:].translate(table).split("*"):
if len(pro) >= min_pro_len:
print "%s...%s - length %i, strand %i, frame %i" % (pro[:30], pro[-3:], len(pro), strand, frame)
ORF也被翻译.您可以选择其他翻译表.查看 http://biopython.org/DIST/docs/tutorial/Tutorial .html#sec:translation
The ORF is also translated. You can choose a different translation table. Check out http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:translation
代码说明:
在顶部,我从您的字符串中创建了一个序列对象.注意seq = Seq("ACGT")
.两个for循环创建6个不同的帧.内部for循环根据选择的翻译表翻译每个框架,并返回一个氨基酸链,其中每个终止密码子均编码为*
. split
函数拆分此字符串,删除这些占位符,从而生成可能的蛋白质序列列表.通过设置min_pro_len,您可以定义要检测的蛋白质的最小氨基酸链长度. 1是标准表.查看 http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1 在这里,您看到起始密码子是AUG
(等于ATG
),而末端密码子(核苷酸序列中的*)是TAA
,TAG
和TGA
,就像您想要的一样.您还可以使用其他翻译表.
Right at the top I create a sequence object out of your string. Notice the seq = Seq("ACGT")
.The two for-loops create the 6 different frames. The inner for-loop translates each frame according to the chosen translation table and returns an amino acid chain where each stop codon is coded as *
. The split
function splits this string removing these placeholders resulting in a list of possible protein sequences. By setting min_pro_len you can define the minimum amino acid chain length for a protein to be detected. 1 is the standard table. Check out http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1 Here you see that the initiation codon is AUG
(equals ATG
) and the end codons (* in the nucleotide sequence) are TAA
, TAG
, and TGA
, just like you wanted. You could also use a different translation table.
添加时
print nuc[frame:].translate(table)
在第二个for循环内,您将得到类似的内容:
right inside the second for-loop you get something like:
PQRGQQGTSQEGEQKLQNILEIAPRKASSQPGRFCPFHSLAQGATSPSRKDTMSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQRDEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGMDLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAELKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL*REWVFIHSLPSPHSDPFTLTPLLSTPQSPQSVSF*LRKGIMAQGPTLCSELSTTTQKHKMLGQ*PGLWASHAPPSRTQMGFPNSLEPRMSIPEFCKGRVVRLPLSQNEAG*DLRPSYLQTFPDSSLRCNAQPSSQSQPPSIYICTYYLLFIYYLFICL*MYLFGRPGCPGGPSVGSCLQTDMFSVKTELSCPHLASLPCCLLFCLCLKQNIYLTQLS**R*FGDQAVATSLNLCSPREP*L*SPYGSLREI
(请注意,星号位于终止密码子位置)
(notice the asterisks are at the stop codon positions)
回答第二个问题:
您必须返回要写入文件的字符串.创建一个输出字符串,并在函数末尾将其返回:
You must return a string that you want write into a file. Create an Output string and return it at the end of the function:
output = "selected tupple is " + str(selected_tupple) + "\n"
output += final_seq + "\n"
output += "The longest orf length is " + str(max_val) + "\n"
return output
这篇关于如何在Python中找到开放阅读框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!