python - 如何使用Python中的Pandas将多个数据集的数据组织到同一数据框中？

我在使用Python中的Pandas将数据保持在所需的数据框架中时遇到麻烦。

我想有一个数据框，其中的数据将分为三列（例如Time，V和I）。

但是，我希望将不同样本的数据放在同一数据框中，以便可以轻松地从Sample#1或Sample#2中选择数据。

我想到的是这样的：

df1 = pd.DataFrame({'Time': np.arange(0,10,0.5), 'V': np.random.rand(20), 'I': np.random.rand(20)})
df1['Sample']= 'sample_1'

df2 = pd.DataFrame({'Time': np.arange(0,10,0.5), 'V': np.random.rand(20), 'I': np.random.rand(20)})
df2['Sample']= 'sample_2'

df = df1.append(df2)

注意，我添加了另一个名为Sample的列，以跟踪哪些数据对应于哪个样本。

但是然后我不知道如何从df调用sample_1或sample_2数据

我该怎么做，这是组织数据的正确方法吗？我应该使用MultiIndex吗？

最佳答案

是的，MultiIndex是一种可能的解决方案：

np.random.seed(1)
df1 = pd.DataFrame({'Time': np.arange(0,10,0.5),
                    'V': np.random.rand(20),
                    'I': np.random.rand(20)})

np.random.seed(2)
df2 = pd.DataFrame({'Time': np.arange(0,10,0.5),
                    'V': np.random.rand(20),
                    'I': np.random.rand(20)})

#print (df1)
#print (df2)

您可以将所有concat DataFrame合为一个，并在参数keys中指定每个源DataFrame：

print (pd.concat([df1, df2], keys=('sample_1','sample_2')))
                    I  Time         V
sample_1 0   0.800745   0.0  0.417022
         1   0.968262   0.5  0.720324
         2   0.313424   1.0  0.000114
         3   0.692323   1.5  0.302333
         4   0.876389   2.0  0.146756
         5   0.894607   2.5  0.092339
         6   0.085044   3.0  0.186260
         7   0.039055   3.5  0.345561
         8   0.169830   4.0  0.396767
         9   0.878143   4.5  0.538817
         10  0.098347   5.0  0.419195
         11  0.421108   5.5  0.685220
         12  0.957890   6.0  0.204452
         13  0.533165   6.5  0.878117
         14  0.691877   7.0  0.027388
         15  0.315516   7.5  0.670468
         16  0.686501   8.0  0.417305
         17  0.834626   8.5  0.558690
         18  0.018288   9.0  0.140387
         19  0.750144   9.5  0.198101
sample_2 0   0.505246   0.0  0.435995
         1   0.065287   0.5  0.025926
         2   0.428122   1.0  0.549662
         3   0.096531   1.5  0.435322
         4   0.127160   2.0  0.420368
         5   0.596745   2.5  0.330335
         6   0.226012   3.0  0.204649
         7   0.106946   3.5  0.619271
         8   0.220306   4.0  0.299655
         9   0.349826   4.5  0.266827
         10  0.467787   5.0  0.621134
         11  0.201743   5.5  0.529142
         12  0.640407   6.0  0.134580
         13  0.483070   6.5  0.513578
         14  0.505237   7.0  0.184440
         15  0.386893   7.5  0.785335
         16  0.793637   8.0  0.853975
         17  0.580004   8.5  0.494237
         18  0.162299   9.0  0.846561
         19  0.700752   9.5  0.079645

xs可以选择数据-请参见cross section：

print (df.xs('sample_1', level=0))
           I  Time         V
0   0.800745   0.0  0.417022
1   0.968262   0.5  0.720324
2   0.313424   1.0  0.000114
3   0.692323   1.5  0.302333
4   0.876389   2.0  0.146756
5   0.894607   2.5  0.092339
6   0.085044   3.0  0.186260
7   0.039055   3.5  0.345561
8   0.169830   4.0  0.396767
9   0.878143   4.5  0.538817
10  0.098347   5.0  0.419195
11  0.421108   5.5  0.685220
12  0.957890   6.0  0.204452
13  0.533165   6.5  0.878117
14  0.691877   7.0  0.027388
15  0.315516   7.5  0.670468
16  0.686501   8.0  0.417305
17  0.834626   8.5  0.558690
18  0.018288   9.0  0.140387
19  0.750144   9.5  0.198101

如果需要，仅选择一些列：

print (df.xs('sample_1', level=0)[['Time','I']])
    Time         I
0    0.0  0.800745
1    0.5  0.968262
2    1.0  0.313424
3    1.5  0.692323
4    2.0  0.876389
5    2.5  0.894607
6    3.0  0.085044
7    3.5  0.039055
8    4.0  0.169830
9    4.5  0.878143
10   5.0  0.098347
11   5.5  0.421108
12   6.0  0.957890
13   6.5  0.533165
14   7.0  0.691877
15   7.5  0.315516
16   8.0  0.686501
17   8.5  0.834626
18   9.0  0.018288
19   9.5  0.750144

另一个解决方案是使用IndexSlice-请参见using slicers

idx = pd.IndexSlice
print (df.loc[idx['sample_1',:], ['Time','I']])
             Time         I
sample_1 0    0.0  0.800745
         1    0.5  0.968262
         2    1.0  0.313424
         3    1.5  0.692323
         4    2.0  0.876389
         5    2.5  0.894607
         6    3.0  0.085044
         7    3.5  0.039055
         8    4.0  0.169830
         9    4.5  0.878143
         10   5.0  0.098347
         11   5.5  0.421108
         12   6.0  0.957890
         13   6.5  0.533165
         14   7.0  0.691877
         15   7.5  0.315516
         16   8.0  0.686501
         17   8.5  0.834626
         18   9.0  0.018288
         19   9.5  0.750144

如果需要，请删除第一级Multiindex：

idx = pd.IndexSlice
print (df.loc[idx['sample_1',:], ['Time','I']].reset_index(level=0, drop=True))
    Time         I
0    0.0  0.800745
1    0.5  0.968262
2    1.0  0.313424
3    1.5  0.692323
4    2.0  0.876389
5    2.5  0.894607
6    3.0  0.085044
7    3.5  0.039055
8    4.0  0.169830
9    4.5  0.878143
10   5.0  0.098347
11   5.5  0.421108
12   6.0  0.957890
13   6.5  0.533165
14   7.0  0.691877
15   7.5  0.315516
16   8.0  0.686501
17   8.5  0.834626
18   9.0  0.018288
19   9.5  0.750144

关于python - 如何使用Python中的Pandas将多个数据集的数据组织到同一数据框中？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/39247527/