我正在尝试从没有重复的CSV创建字典。 CSV文件包含:样品名称(s1,s2等)基因名称,样品1突变的影响,样品2突变的影响。这是CSV文件的两行示例:

s1, s2, gene1, MODERATE, HIGH
s3, s4, gene2, HIGH, MODERATE


我的目标是获得有关特定基因突变的样本数量的摘要,然后得出该突变是否对HIGH产生影响的摘要。

举个例子:

gene12  7   ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24 [HIGH]']
gene20  2   ['s10 [HIGH]', 's21']


目前,我的代码如下:

import os
import sys

path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
    split_lines = line.split(", ")
    gene = split_lines[2]
    sample1 = split_lines[0]
    sample2 = split_lines[1]
    impact1 = split_lines[3]
    impact2 = split_lines[4]
    for i in range(0, len(read_csv):
        if gene in gene_dict:
            if impact1 == "HIGH":
                gene_dict[gene].append(sample1+" [HIGH]")
            if impact2 == "HIGH":
                gene_dict[gene].append(sample2+" [HIGH]")
            else:
                gene_dict[gene].append(sample1)
                gene_dict[gene].append(sample2)
        else:
            gene_dict[gene] = [sample1]

final_dict = {a:list(set(b)) for a, b in gene_dict.items()}

for key, value in final_dict.items():
    genename = key
    num_samples = len([item for item in value if item])
    samples = value
    print(genename,num_samples,samples)


我的脚本工作正常,除了我得到重复的样本。我的意思是,如果样本中的某个基因具有高影响突变,那么最终摘要将列出该样本两次。以下是我的意思的示例:

gene12  8   ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24', 's24 [HIGH]']
gene20  3   ['s10', 's10 [HIGH]', 's21']


这可能是我创建导致重复的字典的方式,但我无法弄清楚。您会看到,对于gene12,s24被列出了两次,从而消除了计数。对于带有s10的gene20也是如此。样品被列出两次,一次正确地具有高冲击突变,而另一次没有高冲击突变。但是,s24仅在gene12中具有HIGH影响突变,而s10仅在gene20中具有HIGH影响突变。我希望这是有道理的。我可以澄清是否需要。在此先感谢您提供的所有帮助!

最佳答案

看来您的内循环for i in range(0, len(read_csv):正在复制并添加无用的匹配。另外,if / if / else结构和添加[HIGH]标记的外观也很差。

更正的版本:

import os
import sys

path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
    split_lines = line.split(", ")
    gene = split_lines[2]
    sample1 = split_lines[0]
    sample2 = split_lines[1]
    impact1 = split_lines[3]
    impact2 = split_lines[4]
    if impact1 == "HIGH":
        sample1 = sample1 + " [HIGH]"
    if impact2 == "HIGH":
        sample2 = sample2 + " [HIGH]"

    if gene in gene_dict:
        gene_dict[gene].append(sample1)
        gene_dict[gene].append(sample2)
    else:
        gene_dict[gene] = [sample1, sample2]

final_dict = {a:list(set(b)) for a, b in gene_dict.items()}

for key, value in final_dict.items():
    genename = key
    num_samples = len([item for item in value if item])
    samples = value
    print(genename,num_samples,samples)


对于我尝试的几个示例,这看起来是一致的。

关于python - 创建没有重复的字典,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54314621/

10-11 18:13