Pandas:将函数应用于每对列

本文介绍了Pandas:将函数应用于每对列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

函数 f(x,y) 接受两个 Pandas 系列并返回一个浮点数.我想将 f 应用于 DataFrame D 中的每一对列，并构造返回值的另一个 DataFrame E，以便 f(D[i],D[j]) 是 i 行和 j 列的值.直接的解决方案是在所有列对上运行嵌套循环:

Function f(x,y) that takes two Pandas Series and returns a floating point number. I would like to apply f to each pair of columns in a DataFrame D and construct another DataFrame E of the returned values, so that f(D[i],D[j]) is the value of the ith row and jth column. The straightforward solution is to run a nested loop over all pairs of columns:

E = pd.DataFrame([[f(D[i], D[j]) for i in D] for j in D],
                 columns=D.columns, index=D.columns)

但是有没有更优雅的解决方案，可能不涉及显式循环?

But is there a more elegant solution that perhaps would not involve explicit loops?

注意这个问题不是 this，尽管名称相似.

NB This question is not a dupe of this, despite the similar names.

编辑一个玩具示例:

D = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=("a","b","c"))
def f(x,y): return x.dot(y)

E
#    a    b    c
#a  66   78   90
#b  78   93  108
#c  90  108  126

推荐答案

您可以使用 Numpy 的广播.

结合 np.vectorize() 和显式签名，我们得到以下内容:

Combined with np.vectorize() and an explicit signature, that gives us the following:

vf = np.vectorize(f, signature='(n),(n)->()')
result = vf(D.T.values, D.T.values[:, None])

注意事项:

您可以在函数中添加一些打印语句(例如 print(f'x:\n{x}\ny:\n{y}\n'))，以说服自己正在做正确的事情.
你的函数 f() 是对称的；如果不是(例如 def f(x, y): return np.linalg.norm(x - y**2))，该参数将扩展为广播事项的额外维度.使用上面的表达式，您将获得与 r E 相同的结果.如果您改为使用 result = vf(D.T.values[:, None], D.T.values)，那么您将得到它的转置.
当然，结果是一个 numpy 数组，如果您希望将其作为 DataFrame 返回，请添加:

you can add some print statement (e.g. print(f'x:\n{x}\ny:\n{y}\n')) in your function, to convince yourself it is doing the right thing.
you function f() is symmetric; if it is not (e.g. def f(x, y): return np.linalg.norm(x - y**2)), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you r E. If instead you use result = vf(D.T.values[:, None], D.T.values), then you'll get its transpose.
the result is a numpy array, of course, and if you want it back as a DataFrame, add:

df = pd.DataFrame(result, index=D.columns, columns=D.columns)

顺便说一句，如果 f() 真的是你的玩具示例中的那个，我相信你已经知道了，你可以直接写:

BTW, if f() is really the one from your toy example, as I'm sure you already know, you can directly write:

df = D.T.dot(D)

性能:

在性能方面，使用广播和向量化的加速大约是 10 倍(在各种矩阵大小下稳定).相比之下，DTdot(D) 对于大小 (100, 100) 的速度提高了 700 多倍，但关键的是，相对加速似乎随着尺寸的增加而变得更高(在我的情况下，速度提高了 12,000 倍)测试，大小 (200, 1000) 导致 1M 循环).所以，像往常一样，有强烈的动机去尝试找到一种方法来使用现有的 numpy 函数来实现你的函数 f()！

Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, D.T.dot(D) is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function f() using existing numpy function(s)!

这篇关于Pandas:将函数应用于每对列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！