问题描述
for
循环真的不好"吗?如果不是,在什么情况下它们会比使用更传统的矢量化"方法更好?
Are for
loops really "bad"? If not, in what situation(s) would they be better than using a more conventional "vectorized" approach?
我熟悉矢量化"的概念,以及熊猫如何利用矢量化技术来加快计算速度.矢量化功能在整个系列或DataFrame上广播操作,以实现比传统上迭代数据快得多的加速.
I am familiar with the concept of "vectorization", and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.
但是,我很惊讶地看到很多代码(包括来自Stack Overflow的答案)提供了解决问题的解决方案,这些问题涉及到使用for
循环和列表推导来遍历数据.文档和API指出循环是不好的"循环,并且绝不能"循环访问数组,序列或DataFrame.那么,为什么有时我会看到用户建议基于循环的解决方案?
However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for
loops and list comprehensions. The documentation and API say that loops are "bad", and that one should "never" iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?
推荐答案
TLDR;不,for
循环至少并非总是如此.说某些矢量化操作比迭代慢,而不是说迭代比某些矢量化操作快,可能更准确.知道何时以及为什么是使代码获得最大性能的关键.简而言之,在这些情况下,值得考虑使用矢量化熊猫函数的替代方法:
TLDR; No, for
loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:
- 当您的数据很小(...取决于您的工作)时,
- 处理
object
/混合dtypes - 使用
str
/regex访问器功能 时
- When your data is small (...depending on what you're doing),
- When dealing with
object
/mixed dtypes - When using the
str
/regex accessor functions
让我们逐一检查这些情况.
Let's examine these situations individually.
Pandas在其API设计中遵循"Convention Over Configuration" 方法.这意味着已经安装了相同的API,以适应广泛的数据和用例.
Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.
当调用pandas函数时,该函数必须在内部处理以下事情(其中包括其他事情),以确保正常工作
When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working
- 索引/轴对齐
- 处理混合数据类型
- 处理丢失的数据
几乎每个功能都必须在不同程度上处理这些问题,这产生了开销.对于数字函数(例如, Series.add
),而对于字符串函数则更为明显(例如, Series.str.replace
).
Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add
), while it is more pronounced for string functions (for example, Series.str.replace
).
for
循环,比你想的要快.更好的是列表理解(可通过循环)的速度更快,因为它们是用于创建列表的优化迭代机制.
for
loops, on the other hand, are faster then you think. What's even better is list comprehensions (which create lists through for
loops) are even faster as they are optimized iterative mechanisms for list creation.
列表理解遵循模式
[f(x) for x in seq]
其中seq
是熊猫系列或DataFrame列.或者,当对多列进行操作时,
Where seq
is a pandas series or DataFrame column. Or, when operating over multiple columns,
[f(x, y) for x, y in zip(seq1, seq2)]
其中seq1
和seq2
是列.
数值比较
考虑一个简单的布尔索引操作.列表推导方法已针对 Series.ne
计时a>(!=
)和 query
一个>.功能如下:
Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne
(!=
) and query
. Here are the functions:
# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
为简单起见,我使用了 perfplot
包来运行本文中的所有timeit测试. .以上操作的时间如下:
For simplicity, I have used the perfplot
package to run all the timeit tests in this post. The timings for the operations above are below:
对于中等大小的N,列表理解要胜过query
,对于较小的N甚至要优于向量化不等于比较.不幸的是,列表理解是线性扩展的,因此对于较大的N来说并不能提供太多的性能提升.
The list comprehension outperforms query
for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.
df[df.A.values != df.B.values]
比大熊猫和列表理解同等学历的表现要好:
NumPy向量化不在本文的讨论范围内,但是如果性能很重要,则绝对值得考虑.
Which outperforms both the pandas and list comprehension equivalents:
NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.
价值计数
再举一个例子-这次,它的另一个香草python结构比for循环更快更快- collections.Counter
.通常的要求是计算值计数并将结果作为字典返回.这是通过 value_counts
, np.unique
和Counter
:
Value Counts
Taking another example - this time, with another vanilla python construct that is faster than a for loop - collections.Counter
. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts
, np.unique
, and Counter
:
# Value Counts comparison.
ser.value_counts(sort=False).to_dict() # value_counts
dict(zip(*np.unique(ser, return_counts=True))) # np.unique
Counter(ser) # Counter
结果更加明显,Counter
在较大的小N(〜3500)范围内胜过两种矢量化方法.
The results are more pronounced, Counter
wins out over both vectorized methods for a larger range of small N (~3500).
当然,从这里得到的好处是性能取决于您的数据和用例.这些示例的目的是说服您不要将这些解决方案排除为合法选项.如果这些仍然不能满足您所需的性能,则总是有 cython 和 numba .让我们将此测试添加到混合中.
Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don't give you the performance you need, there is always cython and numba. Let's add this test into the mix.
from numba import njit, prange
@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]
return np.array(result)
df[get_mask(df.A.values, df.B.values)] # numba
Numba可以将循环python代码的JIT编译为功能非常强大的矢量化代码.了解如何使numba发挥作用需要学习.
Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.
基于字符串的比较
再来看第一部分的过滤示例,如果要比较的列是字符串怎么办?考虑上面相同的3个函数,但将输入DataFrame强制转换为字符串.
String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.
# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
那么,发生了什么变化?这里要注意的是,字符串操作本来就很难向量化. Pandas将字符串视为对象,并且对对象的所有操作都回落到缓慢,循环的实现中.
So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.
现在,由于此循环实现被上述所有开销所包围,因此,即使这些解决方案按比例缩放,它们之间也存在恒定的幅度差异.
Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.
当涉及对可变/复杂对象的操作时,没有比较.列表理解胜过所有涉及字典和列表的操作.
When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.
通过键访问字典值
以下是从字典列中提取值的两个操作的时间安排:map
和列表推导.该设置位于附录的代码段"标题下.
Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map
and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".
# Dictionary value extraction.
ser.map(operator.itemgetter('value')) # map
pd.Series([x.get('value') for x in ser]) # list comprehension
位置列表索引
从列列表中提取第0个元素的3个操作的时间(处理异常), map
, str.get
访问器方法,以及列表理解:
Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map
, str.get
accessor method, and the list comprehension:
# List positional indexing.
def get_0th(lst):
try:
return lst[0]
# Handle empty lists and NaNs gracefully.
except (IndexError, TypeError):
return np.nan
ser.map(get_0th) # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe
pd.Series([...], index=ser.index)
重建系列时.
列表拼合
最后一个例子是扁平化列表.这是另一个常见的问题,它演示了纯python在这里有多么强大.
List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp
itertools.chain.from_iterable
和嵌套列表理解是纯python构造,并且比stack
解决方案要好得多.
Both itertools.chain.from_iterable
and the nested list comprehension are pure python constructs, and scale much better than the stack
solution.
这些时间点很明显地说明了熊猫没有为混合dtypes做好准备的事实,并且您可能应该避免使用它来进行混合dtypes.数据应尽可能在单独的列中作为标量值(整数/浮点数/字符串)显示.
These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.
最后,这些解决方案的适用性在很大程度上取决于您的数据.因此,最好的办法是先决定对数据进行这些操作,然后再决定要使用的内容.请注意我没有计时 apply
在这些解决方案上,因为它会使图形偏斜(是的,那太慢了).
Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply
on these solutions, because it would skew the graph (yes, it's that slow).
熊猫可以应用正则表达式操作,例如 str.contains
, str.extract
和 str.extractall
以及字符串列上的其他向量化"字符串操作(例如str.split
,str.find ,
str.translate`等).这些功能比列表理解要慢,并且是比其他功能更方便的功能.
Pandas can apply regex operations such as str.contains
, str.extract
, and str.extractall
, as well as other "vectorized" string operations (such as str.split
, str.find,
str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.
预编译正则表达式模式并使用 re.compile
(另请参见是否值得使用Python的重新编译?).相当于str.contains
的列表组合看起来像这样:
It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile
(also see Is it worth using Python's re.compile?). The list comp equivalent to str.contains
looks something like this:
p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
或者,
ser2 = ser[[bool(p.search(x)) for x in ser]]
如果您需要处理NaN,则可以执行类似的操作
If you need to handle NaNs, you can do something like
ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
等效于str.extract
的列表组合(无组)将类似于:
The list comp equivalent to str.extract
(without groups) will look something like:
df['col2'] = [p.search(x).group(0) for x in df['col']]
如果您需要处理不匹配和NaN,则可以使用自定义函数(速度更快!):
If you need to handle no-matches and NaNs, you can use a custom function (still faster!):
def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan
df['col2'] = [matcher(x) for x in df['col']]
matcher
功能非常可扩展.根据需要,它可以适合返回每个捕获组的列表.只需提取查询匹配对象的group
或groups
属性即可.
The matcher
function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group
or groups
attribute of the matcher object.
对于str.extractall
,将p.search
更改为p.findall
.
字符串提取
考虑一个简单的过滤操作.这个想法是提取一个大写字母后再提取4位数字.
String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.
# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan
ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension
更多示例
完全披露-我是以下列出的这些帖子的作者(部分或全部).
More Examples
Full disclosure - I am the author (in part or whole) of these posts listed below.
如上面的示例所示,当处理少量的DataFrame,混合数据类型和正则表达式时,迭代会发光.
As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.
您获得的提速取决于您的数据和问题,因此里程可能会有所不同.最好的办法是仔细运行测试,看看是否值得付出努力.
The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.
矢量化"功能以其简单性和可读性着称,因此,如果性能不是很关键,则您绝对应该优先考虑这些功能.
The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.
另一方面,某些字符串操作处理的约束都支持使用NumPy.以下是两个示例,其中仔细的NumPy向量化性能胜过python:
Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:
此外,有时仅通过.values
在基础阵列上进行操作,而不是在Series或DataFrame上进行操作,就可以为大多数常见情况提供足够健康的加速(请参见中的注意上面的数字比较部分).因此,例如df[df.A.values != df.B.values]
将显示比df[df.A != df.B]
即时的性能提升.使用.values
可能并非在每种情况下都适用,但这是一个有用的技巧.
Additionally, sometimes just operating on the underlying arrays via .values
as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values]
would show instant performance boosts over df[df.A != df.B]
. Using .values
may not be appropriate in every situation, but it is a useful hack to know.
如上所述,您应自行决定这些解决方案是否值得实施.
As mentioned above, it's up to you to decide whether these solutions are worth the trouble of implementing.
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections import Counter
from itertools import chain
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
n_range=[2**k for k in range(0, 15)],
xlabel='N'
)
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=['value_counts', 'np.unique', 'Counter'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=lambda x, y: dict(x) == dict(y)
)
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=['vectorized !=', 'query (numexpr)', 'list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter('value')),
lambda ser: pd.Series([x.get('value') for x in ser]),
],
labels=['map', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# List positional indexing.
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=['stack', 'itertools.chain', 'nested list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=['str.extract', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
这篇关于 pandas 中的for循环真的不好吗?我什么时候应该在意?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!