我有以下类型的数据帧-
数据框

A   B   C
5   10  15
20  25  30

我想做以下手术-
A_B   A_C  B_C
-0.33 -0.5 -0.2
-0.11 -0.2 -0.09

A_B,A_C,B_C对应于-
A_B: A-B/A+B
A_C: A-C/A+C
B_C: B-C/B+C

我用的是-
 colnames = df.columns.tolist()[:-1]
 list_name=[]
 for i,c in enumerate(colnames):
     if i!=len(colnames):
        for k in range(i+1,len(colnames)):
            df[c+'_'+colnames[k]]=(df[c]-
            df[colnames[k]])/(df[c]+df[colnames[k]])
            list_name.append(c+'_'+colnames[k])

但问题是我的实际数据帧的大小是5*381形状,因此A_B, A_C and so on的实际组合数是5*72390形状,这需要60分钟才能运行。
所以我试着把它转换成numpy数组,这样我就可以用Numba优化它来有效地计算它(Parallel programming approach to solve pandas problems),但是我不能把它转换成numpy数组。
此外,任何其他解决这一问题的办法也受到欢迎。

最佳答案

使用:

df = pd.DataFrame({
         'A':[5,20],
         'B':[10,25],
         'C':[15,30]
})

print (df)
    A   B   C
0   5  10  15
1  20  25  30

首先将列的所有组合获取到两个列表(a表示元组的第一个值,b表示第二个值):
from  itertools import combinations

a, b = zip(*(combinations(df.columns, 2)))

然后按列表对重复列使用DataFrame.loc
df1 = df.loc[:, a]
print (df1)
    A   A   B
0   5   5  10
1  20  20  25

df2 = df.loc[:, b]
print (df2)
    B   C   C
0  10  15  15
1  25  30  30

将值转换为最终数据帧的numpy数组,并通过列表理解获取新列名:
c = [f'{x}_{y}' for x, y in zip(a, b)]
arr1 = df1.values
arr2 = df2.values
df = pd.DataFrame((arr1-arr2)/(arr1+arr2), columns=c)
print (df)
        A_B  A_C       B_C
0 -0.333333 -0.5 -0.200000
1 -0.111111 -0.2 -0.090909

另一种解决方案非常相似,只需按列长度创建组合,最后通过索引创建新列名称:
from  itertools import combinations

a, b = zip(*(combinations(np.arange(len(df.columns)), 2)))
arr = df.values
cols = df.columns.values
arr1 = arr[:, a]
arr2 = arr[:, b]
c = [f'{x}_{y}' for x, y in zip(cols[np.array(a)], cols[np.array(b)])]
df = pd.DataFrame((arr1-arr2)/(arr1+arr2), columns=c)

性能:
在5行381列中测试:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(10,100,(5,381)))
df.columns = ['c'+str(i+1) for i in range(df.shape[1])]
#print (df)

In [4]: %%timeit
   ...: a, b = zip(*(combinations(np.arange(len(df.columns)), 2)))
   ...: arr = df.values
   ...: cols = df.columns.values
   ...: arr1 = arr[:, a]
   ...: arr2 = arr[:, b]
   ...: c = [f'{x}_{y}' for x, y in zip(cols[np.array(a)], cols[np.array(b)])]
   ...: pd.DataFrame((arr1-arr2)/(arr1+arr2), columns=c)
   ...:
62 ms ± 7.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %%timeit
   ...: a, b = zip(*(combinations(df.columns, 2)))
   ...: df1 = df.loc[:, a]
   ...: df2 = df.loc[:, b]
   ...: arr1 = df1.values
   ...: arr2 = df2.values
   ...: c = [f'{x}_{y}' for x, y in zip(a, b)]
   ...: pd.DataFrame((arr1-arr2)/(arr1+arr2), columns=c)
   ...:
63.2 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %%timeit
   ...: func1(df)
   ...:
89.2 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %%timeit
   ...: a, b = zip(*(combinations(df.columns, 2)))
   ...: df1 = df.loc[:, a]
   ...: df2 = df.loc[:, b]
   ...: c = [f'{x}_{y}' for x, y in zip(a, b)]
   ...: pd.DataFrame((df1.values-df2.values)/(df1.values+df2.values), columns=c)
   ...:
69.8 ms ± 6.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

关于python - 通过列组合提高算术运算的性能,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55116552/

10-12 22:22
查看更多