我正在尝试从没有重复的CSV创建字典。 CSV文件包含:样品名称(s1,s2等)基因名称,样品1突变的影响,样品2突变的影响。这是CSV文件的两行示例:
s1, s2, gene1, MODERATE, HIGH
s3, s4, gene2, HIGH, MODERATE
我的目标是获得有关特定基因突变的样本数量的摘要,然后得出该突变是否对HIGH产生影响的摘要。
举个例子:
gene12 7 ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24 [HIGH]']
gene20 2 ['s10 [HIGH]', 's21']
目前,我的代码如下:
import os
import sys
path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
split_lines = line.split(", ")
gene = split_lines[2]
sample1 = split_lines[0]
sample2 = split_lines[1]
impact1 = split_lines[3]
impact2 = split_lines[4]
for i in range(0, len(read_csv):
if gene in gene_dict:
if impact1 == "HIGH":
gene_dict[gene].append(sample1+" [HIGH]")
if impact2 == "HIGH":
gene_dict[gene].append(sample2+" [HIGH]")
else:
gene_dict[gene].append(sample1)
gene_dict[gene].append(sample2)
else:
gene_dict[gene] = [sample1]
final_dict = {a:list(set(b)) for a, b in gene_dict.items()}
for key, value in final_dict.items():
genename = key
num_samples = len([item for item in value if item])
samples = value
print(genename,num_samples,samples)
我的脚本工作正常,除了我得到重复的样本。我的意思是,如果样本中的某个基因具有高影响突变,那么最终摘要将列出该样本两次。以下是我的意思的示例:
gene12 8 ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24', 's24 [HIGH]']
gene20 3 ['s10', 's10 [HIGH]', 's21']
这可能是我创建导致重复的字典的方式,但我无法弄清楚。您会看到,对于gene12,s24被列出了两次,从而消除了计数。对于带有s10的gene20也是如此。样品被列出两次,一次正确地具有高冲击突变,而另一次没有高冲击突变。但是,s24仅在gene12中具有HIGH影响突变,而s10仅在gene20中具有HIGH影响突变。我希望这是有道理的。我可以澄清是否需要。在此先感谢您提供的所有帮助!
最佳答案
看来您的内循环for i in range(0, len(read_csv):
正在复制并添加无用的匹配。另外,if / if / else结构和添加[HIGH]
标记的外观也很差。
更正的版本:
import os
import sys
path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
split_lines = line.split(", ")
gene = split_lines[2]
sample1 = split_lines[0]
sample2 = split_lines[1]
impact1 = split_lines[3]
impact2 = split_lines[4]
if impact1 == "HIGH":
sample1 = sample1 + " [HIGH]"
if impact2 == "HIGH":
sample2 = sample2 + " [HIGH]"
if gene in gene_dict:
gene_dict[gene].append(sample1)
gene_dict[gene].append(sample2)
else:
gene_dict[gene] = [sample1, sample2]
final_dict = {a:list(set(b)) for a, b in gene_dict.items()}
for key, value in final_dict.items():
genename = key
num_samples = len([item for item in value if item])
samples = value
print(genename,num_samples,samples)
对于我尝试的几个示例,这看起来是一致的。
关于python - 创建没有重复的字典,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54314621/