问题描述
我有大量的DOI,并且我需要最有效的方法来识别重复的DOI(即,打印出重复的值的索引和DOI.)DOI的数组可能包含500,000 + DOI的.我目前的方法是这样的(受此答案启发):
I have a large list of DOI's and I need the most efficient way to identify the DOI's which are repeated (ie. print out the index and the DOI for values which are repeated.) The array of DOI's could consist of 500,000 + DOI's. My current approach is this (inspired by this answer):
from collections import defaultdict
D = defaultdict(list)
for i,item in enumerate(doiList):
D[item].append(i)
D = {k:v for k,v in D.items() if len(v)>1}
print (D)
是否有更有效的处理方法?
Is there a more processing efficient way of doing this?
样本DOI列表:
doiList = ['10.1016/j.ijnurstu.2017.05.011 [doi]','10.1016/j.ijnurstu.2017.05.011 [doi]' ,'10.1167/iovs.16-20421 [doi]', '10.1093/cid/cix478 [doi]', '10.1038/bjc.2017.133 [doi]', '10.3892/or.2017.5646 [doi]', '10.1177/0961203317711009 [doi]', '10.2217/bmm-2017-0087 [doi]', '10.1007/s12016-017-8611-x [doi]', '10.1007/s10753-017-0594-5 [doi]', '10.1186/s13601-017-0150-2 [doi]', '10.3389/fimmu.2017.00515 [doi]', '10.2147/JAA.S131506 [doi]', '10.2147/JAA.S128431 [doi]', '10.1038/s41598-017-02293-z [doi]', '10.18632/oncotarget.17729 [doi]', '10.1073/pnas.1703683114 [doi]', '10.1096/fj.201600857RRR [doi]', '10.1128/AAC.00020-17 [doi]', '10.1016/j.jpain.2017.04.011 [doi]', '10.1016/j.jaip.2017.04.029 [doi]', '10.1016/j.anai.2017.04.021 [doi]', '10.1016/j.alit.2017.05.001 [doi]']
推荐答案
尝试将它们存储在 set
代替.您可以将重复项附加到单个列表中,这样可以加快速度:
Try storing them in a set
instead. You can append the duplicates to a single list, which might speed things up:
seen = set()
dupes = []
for i, doi in enumerate(doiList):
if doi not in seen:
seen.add(doi)
else:
dupes.append(i)
此时,seen
包含所有不同的doi值,而dupes
包含所有重复值的第二,第三等索引.您可以在doiList
中查找它们,以确定哪个索引对应于哪个值.
At this point, seen
contains all the distinct doi values, while dupes
contains all the 2nd, 3rd, etc. indexes of duplicate values. You can look them up in doiList
to determine which index corresponds to which value.
要从中获得更多性能,可以缓存方法:
To get some more performance out of this, you can cache the methods:
seen = set()
seen_add = seen.add
dupes = []
dupes_append = dupes.append
for i, doi in enumerate(doiList):
if doi not in seen:
seen_add(doi)
else:
dupes_append(i)
这篇关于有效识别大型列表中的重复项(500,000+)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!