python - 优化超大型CSV文件中的搜索

我有一个单列的csv文件，但是有620万行，所有行都包含6到20个字母之间的字符串。有些字符串会在重复的（或更多）条目中找到，我想将它们写到一个新的csv文件中-大概应该有大约一百万个非唯一字符串。就是这样。但是，不断搜索一本包含600万条条目的字典确实很费时间，并且我很感激如何使用它的任何提示。根据我所做的一些计时，到目前为止，我编写的任何脚本都至少需要一周（！）才能运行。

第一次尝试：

in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
out_file_2 = open('UniProt Unique Trypsin Peptides.csv','w+')
writer_1 = csv.writer(out_file_1)
writer_2 = csv.writer(out_file_2)

# Create trypsinome dictionary construct
ref_dict = {}
for row in range(len(in_list_1)):
    ref_dict[row] = in_list_1[row]

# Find unique/non-unique peptides from trypsinome
Peptide_list = []
Uniques = []
for n in range(len(in_list_1)):
    Peptide = ref_dict.pop(n)
    if Peptide in ref_dict.values(): # Non-unique peptides
        Peptide_list.append(Peptide)
    else:
        Uniques.append(Peptide) # Unique peptides

for m in range(len(Peptide_list)):
    Write_list = (str(Peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')
    writer_1.writerow(Write_list)

第二次尝试：

in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
writer_1 = csv.writer(out_file_1)

ref_dict = {}
for row in range(len(in_list_1)):
    Peptide = in_list_1[row]
    if Peptide in ref_dict.values():
        write = (in_list_1[row],'')
        writer_1.writerow(write)
    else:
        ref_dict[row] = in_list_1[row]

编辑：这是csv文件中的几行：

SELVQK
AKLAEQAER
AKLAEQAERR
LAEQAER
LAEQAERYDDMAAAMK
LAEQAERYDDMAAAMKK
MTMDKSELVQK
YDDMAAAMKAVTEQGHELSNEER
YDDMAAAMKAVTEQGHELSNEERR

最佳答案

第一个提示：Python支持延迟评估，在处理庞大的数据集时最好使用它。因此：

遍历您的csv.reader，而不是建立庞大的内存列表，
不要在范围内建立庞大的内存列表-如果需要项目和索引，请使用enumate(seq)；如果不需要索引，请仅遍历序列的项目。

第二个提示：使用dict（哈希表）的要点是查找键，而不是值...因此，请勿构建用作列表的大型dict。

第三个提示：如果您只想存储“已经出现”的值，请使用Set。

关于python - 优化超大型CSV文件中的搜索，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/19224903/