问题描述
我发现使用 Pandas 的 iterrows 时性能非常差.
I have noticed very poor performance when using iterrows from pandas.
这是其他人经历过的事情吗?它是否特定于 iterrows,对于特定大小的数据(我正在处理 2-3 百万行),是否应该避免使用此函数?
Is this something that is experienced by others? Is it specific to iterrows and should this function be avoided for data of a certain size (I'm working with 2-3 million rows)?
GitHub 上的这个讨论让我相信这是在数据帧中混合 dtypes 时引起的,但是简单的下面的示例显示即使使用一种 dtype (float64) 时它也存在.这在我的机器上需要 36 秒:
This discussion on GitHub led me to believe it is caused when mixing dtypes in the dataframe, however the simple example below shows it is there even when using one dtype (float64). This takes 36 seconds on my machine:
import pandas as pd
import numpy as np
import time
s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})
start = time.time()
i=0
for rowindex, row in dfa.iterrows():
i+=1
end = time.time()
print end - start
为什么像应用这样的向量化操作要快得多?我想那里也必须有一些逐行迭代.
Why are vectorized operations like apply so much quicker? I imagine there must be some row by row iteration going on there too.
我不知道如何在我的情况下不使用 iterrows(这个我会留到以后的问题中).因此,如果您一直能够避免这种迭代,我将不胜感激.我正在根据单独数据帧中的数据进行计算.谢谢!
I cannot figure out how to not use iterrows in my case (this I'll save for a future question). Therefore I would appreciate hearing if you have consistently been able to avoid this iteration. I'm making calculations based on data in separate dataframes. Thank you!
---下面添加了我想要运行的简化版本---
--- simplified version of what I want to run has been added below---
import pandas as pd
import numpy as np
#%% Create the original tables
t1 = {'letter':['a','b'],
'number1':[50,-10]}
t2 = {'letter':['a','a','b','b'],
'number2':[0.2,0.5,0.1,0.4]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])
#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():
t2info = table2[table2.letter == row['letter']].reset_index()
table3.ix[row_index,] = optimize(t2info,row['number1'])
#%% Define optimization
def optimize(t2info, t1info):
calculation = []
for index, r in t2info.iterrows():
calculation.append(r['number2']*t1info)
maxrow = calculation.index(max(calculation))
return t2info.ix[maxrow]
推荐答案
一般来说,iterrows
应该只在非常非常特殊的情况下使用.这是执行各种操作的一般优先顺序:
Generally, iterrows
should only be used in very, very specific cases. This is the general order of precedence for performance of various operations:
1) vectorization
2) using a custom cython routine
3) apply
a) reductions that can be performed in cython
b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)
使用自定义 Cython 例程通常太复杂,所以我们暂时跳过它.
Using a custom Cython routine is usually too complicated, so let's skip that for now.
1) 矢量化始终是首选,也是最佳选择.但是,有一小组案例(通常涉及重复)无法以明显的方式进行矢量化.此外,在较小的 DataFrame
上,使用其他方法可能会更快.
1) Vectorization is ALWAYS, ALWAYS the first and best choice. However, there is a small set of cases (usually involving a recurrence) which cannot be vectorized in obvious ways. Furthermore, on a smallish DataFrame
, it may be faster to use other methods.
3) apply
通常可以由 Cython 空间中的迭代器处理.这是由 Pandas 内部处理的,尽管它取决于 apply
表达式内部发生的事情.例如,df.apply(lambda x: np.sum(x))
会很快执行,当然,df.sum(1)
会更好.然而,诸如 df.apply(lambda x: x['b'] + 1)
之类的东西将在 Python 空间中执行,因此速度要慢得多.
3) apply
usually can be handled by an iterator in Cython space. This is handled internally by pandas, though it depends on what is going on inside the apply
expression. For example, df.apply(lambda x: np.sum(x))
will be executed pretty swiftly, though of course, df.sum(1)
is even better. However something like df.apply(lambda x: x['b'] + 1)
will be executed in Python space, and consequently is much slower.
4) itertuples
不会将数据装箱到 Series
中.它只是以元组的形式返回数据.
4) itertuples
does not box the data into a Series
. It just returns the data in the form of tuples.
5) iterrows
确实将数据装箱成 Series
.除非你真的需要这个,否则使用另一种方法.
5) iterrows
DOES box the data into a Series
. Unless you really need this, use another method.
6) 一次更新一个空帧.我已经看到这种方法使用太多了.它是迄今为止最慢的.这可能是常见的地方(并且对于某些 python 结构来说相当快),但是 DataFrame
对索引进行了相当多的检查,因此一次更新一行总是很慢.创建新结构和concat
要好得多.
6) Updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame
does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat
.
这篇关于pandas iterrows 有性能问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!