问题描述
请帮我理解为什么这个从字典中替换操作在Python / Pandas中很慢:
Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
字典查找应为O(1)。替换列中的值应为O(1)。这不是一个矢量化的操作吗?即使它没有向量化,迭代200行只有200次迭代,那么怎么缓慢?
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
这是一个SSCCE演示的问题:
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
运行该命令在我的机器上运行超过1秒,这是执行< 1000操作的预期时间的1000倍。
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
推荐答案
它看起来像有一点开销,并明确地告诉系列要做什么,通过产生最佳效果:
It looks like replace
has a bit of overhead, and explicitly telling the Series what to do via map
yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
如果您确定所有的键都在您的字典中,您可以通过不创建一个lambda来获得非常轻微的性能提升,并直接提供 dictionary.get
功能。不存在的任何键将通过此方法返回 NaN
,请注意:
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get
function. Any keys that are not present will return NaN
via this method, so beware:
series = series.map(dictionary.get)
您还可以只提供字典本身,但这似乎引起了一些开销:
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
计时
示例数据:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
这篇关于 pandas 取代/字典慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!