问题描述
如何以一种简洁而全面的方式来衡量以下各项功能的性能.
How do I measure the performance of the various functions below in a concise and comprehensive way.
考虑数据框df
df = pd.DataFrame({
'Group': list('QLCKPXNLNTIXAWYMWACA'),
'Value': [29, 52, 71, 51, 45, 76, 68, 60, 92, 95,
99, 27, 77, 54, 39, 23, 84, 37, 99, 87]
})
我想总结Value
列,这些列按Group
中的不同值分组.我有三种方法可以做到这一点.
I want to sum up the Value
column grouped by distinct values in Group
. I have three methods for doing it.
import pandas as pd
import numpy as np
from numba import njit
def sum_pd(df):
return df.groupby('Group').Value.sum()
def sum_fc(df):
f, u = pd.factorize(df.Group.values)
v = df.Value.values
return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()
@njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
def sum_nb(df):
b, u = pd.factorize(df.Group.values)
w = df.Value.values
bins = wbcnt(b, w, u.size)
return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()
它们是相同的吗?
print(sum_pd(df).equals(sum_nb(df)))
print(sum_pd(df).equals(sum_fc(df)))
True
True
他们有多快?
%timeit sum_pd(df)
%timeit sum_fc(df)
%timeit sum_nb(df)
1000 loops, best of 3: 536 µs per loop
1000 loops, best of 3: 324 µs per loop
1000 loops, best of 3: 300 µs per loop
推荐答案
它们可能不会归类为简单框架",因为它们是需要安装的第三方模块,但我经常使用两个框架:
They might not classify as "simple frameworks" because they are third-party modules that need to be installed but there are two frameworks I often use:
-
simple_benchmark
(我是该软件包的作者) -
perfplot
例如,simple_benchmark
库允许装饰功能以进行基准测试:
For example the simple_benchmark
library allows to decorate the functions to benchmark:
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
import pandas as pd
import numpy as np
from numba import njit
@b.add_function()
def sum_pd(df):
return df.groupby('Group').Value.sum()
@b.add_function()
def sum_fc(df):
f, u = pd.factorize(df.Group.values)
v = df.Value.values
return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()
@njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
@b.add_function()
def sum_nb(df):
b, u = pd.factorize(df.Group.values)
w = df.Value.values
bins = wbcnt(b, w, u.size)
return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()
还修饰一个生成基准值的函数:
Also decorate a function that produces the values for the benchmark:
from string import ascii_uppercase
def creator(n): # taken from another answer here
letters = list(ascii_uppercase)
np.random.seed([3,1415])
df = pd.DataFrame(dict(
Group=np.random.choice(letters, n),
Value=np.random.randint(100, size=n)
))
return df
@b.add_arguments('Rows in DataFrame')
def argument_provider():
for exponent in range(4, 22):
size = 2**exponent
yield size, creator(size)
然后运行基准测试所需要做的就是:
And then all you need to run the benchmark is:
r = b.run()
之后,您可以将结果检查为图(为此需要matplotlib
库)
After that you can inspect the results as plot (you need the matplotlib
library for this):
r.plot()
如果功能在运行时非常相似,则百分比差异而不是绝对数字可能更为重要:
In case the functions are very similar in run-time the percentage difference instead of absolute numbers could be more important:
r.plot_difference_percentage(relative_to=sum_nb)
或者获取基准时间为DataFrame
(这需要pandas
)
Or get the times for the benchmark as DataFrame
(this needs pandas
)
r.to_pandas_dataframe()
sum_pd sum_fc sum_nb
16 0.000796 0.000515 0.000502
32 0.000702 0.000453 0.000454
64 0.000702 0.000454 0.000456
128 0.000711 0.000456 0.000458
256 0.000714 0.000461 0.000462
512 0.000728 0.000471 0.000473
1024 0.000746 0.000512 0.000513
2048 0.000825 0.000515 0.000514
4096 0.000902 0.000609 0.000640
8192 0.001056 0.000731 0.000755
16384 0.001381 0.001012 0.000936
32768 0.001885 0.001465 0.001328
65536 0.003404 0.002957 0.002585
131072 0.008076 0.005668 0.005159
262144 0.015532 0.011059 0.010988
524288 0.032517 0.023336 0.018608
1048576 0.055144 0.040367 0.035487
2097152 0.112333 0.080407 0.072154
如果您不喜欢装饰器,也可以在一个调用中设置所有内容(在这种情况下,您不需要BenchmarkBuilder
和add_function
/add_arguments
装饰器):
In case you don't like the decorators you could also setup everything in one call (in that case you don't need the BenchmarkBuilder
and the add_function
/add_arguments
decorators):
from simple_benchmark import benchmark
r = benchmark([sum_pd, sum_fc, sum_nb], {2**i: creator(2**i) for i in range(4, 22)}, "Rows in DataFrame")
此处perfplot
提供了非常相似的界面(和结果):
Here perfplot
offers a very similar interface (and result):
import perfplot
r = perfplot.bench(
setup=creator,
kernels=[sum_pd, sum_fc, sum_nb],
n_range=[2**k for k in range(4, 22)],
xlabel='Rows in DataFrame',
)
import matplotlib.pyplot as plt
plt.loglog()
r.plot()
这篇关于可以使用哪些技术来衡量 pandas /numpy解决方案的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!