问题描述
我有一个(可能相当大的)字典和一个'可能的'键列表。我想快速找到哪些键在字典中具有匹配的值。我发现很多关于单个字典值的讨论和这里,但没有讨论速度或多个条目。我已经提出了四种方式,对于最有效的三种方法,我比较不同样本大小的速度下面 - 有更好的方法吗?如果人们可以建议合理的竞争者,我将会对他们进行分析。
样例列表和词典创建如下:
导入cProfile
从随机导入randint
长度= 100000
listOfRandomInts = [randint 0,length * length / 10-1)for x in range(length)]
dictionaryOfRandomInts = {randint(0,length * length / 10-1):在这里for x in range(length)}
方法1:'中的关键字:
def way1(theList,theDict):
resultList = []
列表中的listItem:
如果listItem在theDict中:
resultsList.append(theDict [listItem])
返回resultList
cProfile.run('way1(listOfRandomInts,dictionaryOfRandomInts)')
32个函数调用在0.018秒
方法2:错误处理:
def way2(theList,theDict):
resultsList = []
列表中的listItem:
try:
resultsList.append(theDict [listItem])
除了:
;
return resultsList
cProfile.run('way2(listOfRandomInts,dictionaryOfRandomInts)')
在0.087秒内执行32个函数调用
方法3:设置交集:
def way3(theList,theDict):
返回列表(set(theList) .intersection(set(theDict.keys())))
cProfile.run('way3(listOfRandomInts,dictionaryOfRandomInts)')
26秒函数调用0.046秒
方法4:原生使用 dict.keys()
:
这是一个警告性的故事 - 这是我的第一次尝试,并且是最慢的!
def way4(theList ,theDict):
resultsList = []
keys = theDict.keys()
在列表中的listItem:
如果key中的listItem:
resultsList.append(theDict [listItem])
返回resultList
cProfile.run('way4(listOfRandomInts,dictionaryOfRandomInts)')
248.552秒的12个功能调用
编辑:答案与我用于一致性的框架相同。许多人注意到,在Python 3.x中可以实现更多的性能提升,特别是基于列表推导的方法。非常感谢所有的帮助!
方法5:更好的执行交叉的方式(感谢jonrsharpe):
def way5(theList,theDict):
return = list(set(theList).intersection(theDict))
0.037秒内的25个函数调用
方法6:列表理解(感谢jonrsharpe):
code> def way6(theList,theDict):
return [item中的item的项目如果item中的项目]
0.020秒内的24个函数调用
方法7:使用&
关键字(感谢jonrsharpe):
def way7(theList,theDict):
返回列表(theDict.viewkeys()& theList)
25秒函数调用0.026秒
对于方法1-3和5-7,我用lis长度为1000,10000,100000,1000000,10000000和100000000的t /字典,并显示所用时间的日志对数图。在所有长度上,交点和语句方法表现更好。梯度大约为1(可能稍高一点),表示O(n)或略微超线性缩放。
我尝试了几种其他方法,最快的是一个简单的列表理解:
def way6(theList,theDict):
return [item中的item的项目如果item中的项目]
这个运行方式与您最快的方法相同, way1
,但更快。为了比较,最快的设置
的方式是
def way5 theList,theDict):
返回列表(set(theList).intersection(theDict))
timeit
结果:
>>> import timeit
>>>从__main__导入方式1中的setup =导入方式1,way5,way6
从随机导入randint
length = 100000
listOfRandomInts = [randint(0,length * length / 10-1)范围(长度)]
dictionaryOfRandomInts = {randint(0,length * length / 10-1):范围(长度)中x的这里)
> >> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
14.550477756582723
>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
19.597916393388232
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
13.652289059326904
添加了@ abarnert的建议:
def way7(theList ,theDict):
返回列表(theDict.viewkeys()& theList)
和重新运行我现在得到的时间:
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
13.110055883138497
>>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
17.292466681101036
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
14.351759544463917
>>>> timeit.timeit('way7(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
17.206370930653392
way1
和 way6
已切换位置,所以我重新运行:
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
13.648176054011941
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)',setup = setup,number = 1000)
13.847062579316628
所以看起来设置的方法比列表慢,但列表和列表的理解之间的差异(令人惊讶的是,至少对我来说)是一个有点变量。我会选择一个,而不用担心,除非它成为一个真正的瓶颈。
I have a (potentially quite big) dictionary and a list of 'possible' keys. I want to quickly find which of the keys have matching values in the dictionary. I've found lots of discussion of single dictionary values here and here, but no discussion of speed or multiple entries.
I've come up with four ways, and for the three that work best I compare their speed on different sample sizes below - are there better methods? If people can suggest sensible contenders I'll subject them to the analysis below as well.
Sample lists and dictionaries are created as follows:
import cProfile
from random import randint
length = 100000
listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)]
dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)}
Method 1: the 'in'
keyword:
def way1(theList,theDict):
resultsList = []
for listItem in theList:
if listItem in theDict:
resultsList.append(theDict[listItem])
return resultsList
cProfile.run('way1(listOfRandomInts,dictionaryOfRandomInts)')
32 function calls in 0.018 seconds
Method 2: error handling:
def way2(theList,theDict):
resultsList = []
for listItem in theList:
try:
resultsList.append(theDict[listItem])
except:
;
return resultsList
cProfile.run('way2(listOfRandomInts,dictionaryOfRandomInts)')
32 function calls in 0.087 seconds
Method 3: set intersection:
def way3(theList,theDict):
return list(set(theList).intersection(set(theDict.keys())))
cProfile.run('way3(listOfRandomInts,dictionaryOfRandomInts)')
26 function calls in 0.046 seconds
Method 4: Naive use of dict.keys()
:
This is a cautionary tale - it was my first attempt and BY FAR the slowest!
def way4(theList,theDict):
resultsList = []
keys = theDict.keys()
for listItem in theList:
if listItem in keys:
resultsList.append(theDict[listItem])
return resultsList
cProfile.run('way4(listOfRandomInts,dictionaryOfRandomInts)')
12 function calls in 248.552 seconds
EDIT: Bringing the suggestions given in the answers into the same framework that I've used for consistency. Many have noted that more performance gains can be achieved in Python 3.x, particularly list comprehension-based methods. Many thanks for all of the help!
Method 5: Better way of performing intersection (thanks jonrsharpe):
def way5(theList, theDict):
return = list(set(theList).intersection(theDict))
25 function calls in 0.037 seconds
Method 6: List comprehension (thanks jonrsharpe):
def way6(theList, theDict):
return [item for item in theList if item in theDict]
24 function calls in 0.020 seconds
Method 7: Using the &
keyword (thanks jonrsharpe):
def way7(theList, theDict):
return list(theDict.viewkeys() & theList)
25 function calls in 0.026 seconds
For methods 1-3 and 5-7 I timed them as above with list/dictionaries of length 1000, 10000, 100000, 1000000, 10000000 and 100000000 and show a log-log plot of time taken. Across all lengths the intersection and in-statement method perform better. The gradients are all about 1 (maybe a bit higher), indicating O(n) or perhaps slightly super-linear scaling.
Of a couple of additional methods I've tried, the fastest was a simple list comprehension:
def way6(theList, theDict):
return [item for item in theList if item in theDict]
This runs the same process as your fastest approach, way1
, but more quickly. For comparison, the quickest set
-based way was
def way5(theList, theDict):
return list(set(theList).intersection(theDict))
timeit
results:
>>> import timeit
>>> setup = """from __main__ import way1, way5, way6
from random import randint
length = 100000
listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)]
dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)}
"""
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
14.550477756582723
>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
19.597916393388232
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.652289059326904
Having added @abarnert's suggestion:
def way7(theList, theDict):
return list(theDict.viewkeys() & theList)
and re-run the timing I now get:
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.110055883138497
>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
17.292466681101036
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
14.351759544463917
>>> timeit.timeit('way7(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
17.206370930653392
way1
and way6
have switched places, so I re-ran again:
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.648176054011941
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.847062579316628
So it looks like the set approach is slower than the list, but the difference between the list and list comprehension is (surprisingly, to me at least) is a bit variable. I'd say just pick one, and not worry about it unless it becomes a real bottleneck later.
这篇关于从给定的列表快速查找字典中的所有键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!