python - 为什么在 DataFrame 上应用函数(使用 "apply")比在系列上快得多？

为什么在 DataFrame 上应用函数比在 Series 上应用快得多？

import time
import pandas as pd
import numpy as np

my_range = np.arange(0, 1_000_000, 1)
df = pd.DataFrame(my_range, columns=["a"])
my_time = time.time()
df["b"] = df.a.apply(lambda x: x ** 2)
print(time.time() - my_time)
# 7.199899435043335


my_range = np.arange(0, 1_000_000, 1)
df = pd.DataFrame(my_range, columns=["a"])
my_time = time.time()
df["b"] = df.apply(lambda x: x ** 2)
print(time.time() - my_time)
# 0.09276103973388672

最佳答案

时间差的原因是 apply 上的 Series 对 Series 中的每个值调用函数。但是对于 DataFrame 它只为每一列调用一次函数。

>>> my_range = np.arange(0, 10, 1, )
>>> df = pd.DataFrame(my_range, columns=["a"])
>>> _ = df.a.apply(lambda x: print(x, type(x)) or x ** 2)
0 <class 'int'>
1 <class 'int'>
2 <class 'int'>
3 <class 'int'>
4 <class 'int'>
5 <class 'int'>
6 <class 'int'>
7 <class 'int'>
8 <class 'int'>
9 <class 'int'>

>>> _ = df.apply(lambda x: print(x, type(x)) or x ** 2)
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
Name: a, dtype: int32 <class 'pandas.core.series.Series'>
[... repeated one more time ...]

我将忽略此处讨论的第二个调用(根据 DYZ，这是检查它是否可以采用快速路径的 Pandas 方式)。

因此，在您的情况下，您有 2 个调用(DataFrame)与 1_000_000 个调用(系列)。这已经解释了大部分时间差异。

鉴于它们的工作方式有多么不同，根本没有可比性。如果您将该功能应用于整个系列，则完全不同(更快):

import pandas as pd
import numpy as np

my_range = np.arange(0, 1_000_000, 1, )
df = pd.DataFrame(my_range, columns=["a"])
%timeit df.a.apply(lambda x: x ** 2)
# 765 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.apply(lambda x: x ** 2)
# 63.2 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.a ** 2  # apply function on the whole series directly
# 10.9 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

关于python - 为什么在 DataFrame 上应用函数(使用 "apply")比在系列上快得多？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/51806731/