我有一个fastq文件,例如reads.fastq。我有一个7-mer字符串列表。对于每个在reads.fastq中读取的内容,我想检查它是否包含列表中至少一个7-mer字符串。条件是,如果找到匹配项(hamming distance ==0),则将读取内容写入数组chosen_reads,并且匹配来自fastq文件的下一个读取内容。如果未找到匹配项,则循环继续进行,直到找到匹配项为止。输出数组由唯一读取组成,因为一旦找到第一个匹配项,匹配循环就会终止。我编写了以下代码,但由于报告了汉明距离为零的所有匹配项,因此输出数组中的读取不是唯一的。请提出修改建议:

def hamming(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")

    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

for x in Bio.SeqIO.parse("reads.fastq","fastq"):
        reads_array.append(x)

nmer = 7
l_chosen = ['gttattt','attattt','tgctagt']

chosen_reads = []
for x in reads_array:
    s2 = str(x.seq)
    for s in [s2[i:i+nmer] for i in range(len(s2)-nmer-1)]:
        for ds in l_chosen:
            dist = hamming(ds,s)
            if dist == 0:
                print s2, s,ds,dist
                chosen_reads.append(x)

最佳答案

当当前代码发现汉明距离为0的字符串时,您的当前代码不会从循环中跳出以从下一个read中读取,您应该使用标志来决定何时断开,并为该标志分配True值当您需要突围时-

def hamming(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

for x in Bio.SeqIO.parse("reads.fastq","fastq"):
        reads_array.append(x)

nmer = 7

l_chosen = ['gttattt','attattt','tgctagt']
chosen_reads = []

for x in reads_array:
        s2 = str(x.seq)
        breakFlag = False
        for s in [s2[i:i+nmer] for i in range(len(s2)-nmer-1)]:
                for ds in l_chosen:
                        dist = hamming(ds,s)
                        if dist == 0:
                                print s2, s,ds,dist
                                chosen_reads.append(x)
                                breakFlag = True
                                break;
                if breakFlag:
                        break;


并且您确定要在reads.fastq后面附加x似乎是错误的,为了获得唯一的匹配,也许您应该在chosen_reads字符串后面加上匹配的s2正确吗?如果那是您想要的,则可以将元组附加到ds,如下所示,而不是当前的附加逻辑-

chosen_reads.append((ds, s2))

关于python - 选择汉明距离为零的读数,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30929484/

10-12 17:00