本文介绍了 pandas 取代/字典慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请帮我理解为什么这个从字典中替换操作在Python / Pandas中很慢:

Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:

# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)

字典查找应为O(1)。替换列中的值应为O(1)。这不是一个矢量化的操作吗?即使它没有向量化,迭代200行只有200次迭代,那么怎么缓慢?

Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?

这是一个SSCCE演示的问题:

Here is a SSCCE demonstrating the issue:

import pandas as pd
import random

# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
    dictionary[x] = 'Some string ' + str(x)
for x in range(200):
    orig.append(random.randint(1, 11269))
series = pd.Series(orig)

# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')

运行该命令在我的机器上运行超过1秒,这是执行< 1000操作的预期时间的1000倍。

Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.

推荐答案

它看起来像有一点开销,并明确地告诉系列要做什么,通过产生最佳效果:

It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:

series = series.map(lambda x: dictionary.get(x,x))

如果您确定所有的键都在您的字典中,您可以通过不创建一个lambda来获得非常轻微的性能提升,并直接提供 dictionary.get 功能。不存在的任何键将通过此方法返回 NaN ,请注意:

If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:

series = series.map(dictionary.get)

您还可以只提供字典本身,但这似乎引起了一些开销:

You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:

series = series.map(dictionary)

计时

示例数据:

%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop

%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop

%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop

%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop

这篇关于 pandas 取代/字典慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 21:13