pandas 高性能笛卡尔积(CROSS JOIN)

本文介绍了 pandas 高性能笛卡尔积(CROSS JOIN)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给出两个简单的DataFrames；

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

left

  col1  col2
0    A     1
1    B     2
2    C     3

right

  col1  col2
0    X    20
1    Y    30
2    Z    50

这些框架的叉积可以计算出来，如下所示:

The cross product of these frames can be computed, and will look something like:

A       1      X      20
A       1      Y      30
A       1      Z      50
B       2      X      20
B       2      Y      30
B       2      Z      50
C       3      X      20
C       3      Y      30
C       3      Z      50

计算此结果的最有效方法是什么?

What is the most performant method of computing this result?

推荐答案

让我们从建立基准开始.解决此问题的最简单方法是使用临时的键"列:

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):
    return (
       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

  col1_x  col2_x col1_y  col2_y
0      A       1      X      20
1      A       1      Y      30
2      A       1      Z      50
3      B       2      X      20
4      B       2      Y      30
5      B       2      Z      50
6      C       3      X      20
7      C       3      Y      30
8      C       3      Z      50

这是如何为两个DataFrame分配一个具有相同值(例如1)的临时键"列的. merge然后对键"执行多对多JOIN.

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

尽管多对多JOIN技巧适用于大小合理的DataFrame，但您会在较大数据上看到相对较低的性能.

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

更快的实现将需要NumPy.这是一些著名的一维笛卡尔积的NumPy实现.我们可以基于这些高性能解决方案中的一些来获得所需的输出.但是，我最喜欢的是@senderle的第一个实现.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

概括:对唯一或非唯一索引数据帧进行交叉联接

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

此技巧适用于任何类型的DataFrame.我们使用前述的cartesian_product计算DataFrames的数字索引的笛卡尔积，使用它来重新索引DataFrames，然后

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):
    la, lb = len(left), len(right)
    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
    return pd.DataFrame(
        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left, right))
True

并且，类似地，

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
   col1  col2
s1    A     1
s2    B     2
s1    C     3

right2
  col1  col2
x    X    20
y    Y    30
y    Z    50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left2, right2))
True

此解决方案可以推广到多个DataFrame.例如，

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
    return pd.DataFrame(
        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

   0  1  2   3  4  5
0  A  1  X  20  A  1
1  A  1  X  20  B  2
2  A  1  X  20  C  3
3  A  1  X  20  D  4
4  A  1  Y  30  A  1

进一步简化

当处理仅两个数据帧时，可能会出现一种不涉及@senderle的cartesian_product的更简单解决方案.使用np.broadcast_arrays，我们可以达到几乎相同的性能水平.

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
               cartesian_product_basic(left2, right2))
True

性能比较

在具有唯一索引的某些人为设计的DataFrame上对这些解决方案进行基准测试，我们拥有

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

请注意，时间设置可能会根据您的设置，数据和cartesian_product辅助功能的选择而有所不同.

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

性能基准测试代码
这是时间脚本.上面定义了此处调用的所有功能.

Performance Benchmarking Code
This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cartesian_product_basic', 'cartesian_product_generalized', 
              'cartesian_product_multi', 'cartesian_product_simplified'],
       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        # print(f,c)
        left2 = pd.concat([left] * c, ignore_index=True)
        right2 = pd.concat([right] * c, ignore_index=True)
        stmt = '{}(left2, right2)'.format(f)
        setp = 'from __main__ import left2, right2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

这篇关于 pandas 高性能笛卡尔积(CROSS JOIN)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！