问题描述
给出两个简单的DataFrames;
Given two simple DataFrames;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1 col2
0 A 1
1 B 2
2 C 3
right
col1 col2
0 X 20
1 Y 30
2 Z 50
这些框架的叉积可以计算出来,如下所示:
The cross product of these frames can be computed, and will look something like:
A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50
计算此结果的最有效方法是什么?
What is the most performant method of computing this result?
推荐答案
让我们从建立基准开始.解决此问题的最简单方法是使用临时的键"列:
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
这是如何为两个DataFrame分配一个具有相同值(例如1)的临时键"列的. merge
然后对键"执行多对多JOIN.
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge
then performs a many-to-many JOIN on "key".
尽管多对多JOIN技巧适用于大小合理的DataFrame,但您会在较大数据上看到相对较低的性能.
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
更快的实现将需要NumPy.这是一些著名的一维笛卡尔积的NumPy实现.我们可以基于这些高性能解决方案中的一些来获得所需的输出.但是,我最喜欢的是@senderle的第一个实现.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
概括:对唯一或非唯一索引数据帧进行交叉联接
Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames
此技巧适用于任何类型的DataFrame.我们使用前述的cartesian_product
计算DataFrames的数字索引的笛卡尔积,使用它来重新索引DataFrames,然后
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product
, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
并且,类似地,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
此解决方案可以推广到多个DataFrame.例如,
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
进一步简化
当处理仅两个数据帧时,可能会出现一种不涉及@senderle的cartesian_product
的更简单解决方案.使用np.broadcast_arrays
,我们可以达到几乎相同的性能水平.
Further Simplification
A simpler solution not involving @senderle's cartesian_product
is possible when dealing with just two DataFrames. Using np.broadcast_arrays
, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
性能比较
在具有唯一索引的某些人为设计的DataFrame上对这些解决方案进行基准测试,我们拥有
Performance Comparison
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
请注意,时间设置可能会根据您的设置,数据和cartesian_product
辅助功能的选择而有所不同.
Do note that timings may vary based on your setup, data, and choice of cartesian_product
helper function as applicable.
性能基准测试代码
这是时间脚本.上面定义了此处调用的所有功能.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
这篇关于 pandas 高性能笛卡尔积(CROSS JOIN)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!