按行随机连接数据帧

本文介绍了按行随机连接数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我如何按行随机合并，联接或连接熊猫数据帧?假设我有四个类似这样的数据帧(具有更多行):

How can I randomly merge, join or concat pandas data frames by row? Suppose I have four data frames something like this (with a lot more rows):

df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"]})

如何连接这四个数据帧，随机输出类似的内容(它们是逐行随机合并的):

How can I join these four data frames randomly output something like this (they are randomly merged row for row):

  col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0  1_1  1_2  1_3  4_1  4_2  4_3  2_1  2_2  2_3  3_1  3_2  3_3
1  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3

我以为我可以做这样的事情:

I was thinking I could do something like this:

my_list = [df1,df2,df3,df4]
my_list = random.sample(my_list, len(my_list))
df = pd.DataFrame({'empty' : []})

for row in df:
    new_df = pd.concat(my_list, axis=1)

print new_df

for语句的作用范围不止第一行，(我有更多)之后的每一行都将是相同的，即它只会随机播放一次:

Above for statement will not work for more than the first row, every row after (I have more) will just be the same, i.e it will only shuffle once:

  col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0  4_1  4_2  4_3  1_1  1_2  1_3  2_1  2_2  2_3  3_1  3_2  3_3
1  4_1  4_2  4_3  1_1  1_2  1_3  2_1  2_2  2_3  3_1  3_2  3_3

推荐答案

更新:来自@Divakar的更好的解决方案:

UPDATE: a much better solution from @Divakar:

df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"], 'col4':["1_4", "1_4"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"], 'col4':["2_4", "2_4"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"], 'col4':["3_4", "3_4"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"], 'col4':["4_4", "4_4"]})

dfs = [df1, df2, df3, df4]
n = len(dfs)
nrows = dfs[0].shape[0]
ncols = dfs[0].shape[1]
A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)
sidx = np.random.rand(nrows,n).argsort(1)
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
df = pd.DataFrame(out_arr)

输出:

In [203]: df
Out[203]:
    0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15
0  3_1  3_2  3_3  3_4  1_1  1_2  1_3  1_4  4_1  4_2  4_3  4_4  2_1  2_2  2_3  2_4
1  4_1  4_2  4_3  4_4  2_1  2_2  2_3  2_4  3_1  3_2  3_3  3_4  1_1  1_2  1_3  1_4

说明:(c)Divakar

Explanation: (c) Divakar

基于NumPy的解决方案

让我们拥有一个基于NumPy的矢量化解决方案，希望有一个快速的解决方案！

Let's have a NumPy based vectorized solution and hopefully a fast one!

1)让我们将连接值的数组重塑为3D数组，将每一行切"为与每个输入数据帧中列数相对应的ncols组-

1) Let's reshape an array of concatenated values into a 3D array "cutting" each row into groups of ncols corresponding to the # of columns in each of the input dataframes -

A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)

2)接下来，我们欺骗np.aragsort给予我们从0到N-1的随机唯一索引，其中N是输入数据帧的数量-

2) Next up, we trick np.aragsort to give us random unique indices ranging from 0 to N-1, where N is the number of input dataframes -

sidx = np.random.rand(nrows,n).argsort(1)

3)最终的技巧是NumPy的花式索引以及一些广播，以便使用sidx索引到A以提供输出数组-

3) Final trick is NumPy's fancy indexing together with some broadcasting to index into A with sidx to give us the output array -

out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)

4)如果需要，转换为数据框-

4) If needed, convert to dataframe -

df = pd.DataFrame(out_arr)

老答案:

IIUC，您可以通过以下方式做到这一点:

IIUC you can do it this way:

dfs = [df1, df2, df3, df4]
n = len(dfs)
ncols = dfs[0].shape[1]
v = pd.concat(dfs, axis=1).values
a = np.arange(n * ncols).reshape(n, df1.shape[1])

df = pd.DataFrame(np.asarray([v[i, a[random.sample(range(n), n)].reshape(n * ncols,)] for i in dfs[0].index]))

输出

In [150]: df
Out[150]:
    0    1    2    3    4    5    6    7    8    9    10   11
0  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3  2_1  2_2  2_3
1  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3

说明:

In [151]: v
Out[151]:
array([['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3'],
       ['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3']], dtype=object)

In [152]: a
Out[152]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

这篇关于按行随机连接数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！