我有两个熊猫数据框,我想组合成一个。我希望将结果数据框沿非索引轴(在我的情况下为'seconds_since_start')排序。我想合并'seconds_since_start'具有相同值的行。我还想保留两个数据框之间的唯一列。

显示给定的输入和所需的输出可能会更容易。



df_a = """
valid_a,value_a,seconds_since_start
2000-02-15 14:47:00,12.3,0.0
2000-02-15 15:59:00,20.6,30.0
2000-02-15 16:51:00,20.3,120.0
2000-02-15 17:52:00,22.6,200.0
"""

df_b = """
valid_b,value_b,seconds_since_start
2019-12-24 14:54:00,12.4,20.0
2019-12-24 15:54:00,18.7,30.0
2019-12-24 16:54:00,19.2,90.0
2019-12-24 17:54:00,20.8,250.0
"""

df_desired_output = """
valid_a,valid_b,value_a,value_b,seconds_since_start
2000-02-15 14:47:00,,12.3,,0.0
,2019-12-24 14:54:00,,12.4,20.0
2000-02-15 15:59:00,2019-12-24 15:54:00,20.6,18.7,30.0
,2019-12-24 16:54:00,,19.2,90.0
2000-02-15 16:51:00,,20.3,,120.0
2000-02-15 17:52:00,,22.6,,200.0
,2019-12-24 17:54:00,,20.8,250.0
"""

from io import StringIO
import pandas as pd
import numpy as np

df_a = StringIO(df_a)
df_a = pd.read_csv(df_a)
df_a['valid_a'] = pd.to_datetime(df_a['valid_a'])  # convert 'valid' column to pd.datetime objects
df_a = df_a.set_index('valid_a')  # set the 'valid' as index

df_b = StringIO(df_b)
df_b = pd.read_csv(df_b)
df_b['valid_b'] = pd.to_datetime(df_b['valid_b'])  # convert 'valid' column to pd.datetime objects
df_b = df_b.set_index('valid_b')  # set the 'valid' as index

df_desired_output = StringIO(df_desired_output)
df_desired_output = pd.read_csv(df_desired_output)


print('input dataframe A\n', df_a)
print('input dataframe B\n', df_b)
print('desired output dataframe\n', df_desired_output)

df_new = pd.concat([df_a, df_b], sort=False)  # can't sort by 'seconds_since_start' from here so I do it on the next line
df_new = df_new.sort_values(by='seconds_since_start')  # sort
print('actual output\n', df_new)  # fails to merge rows that have the same value for 'seconds_since_start'




输出量

input dataframe A
                      value_a  seconds_since_start
valid_a
2000-02-15 14:47:00     12.3                  0.0
2000-02-15 15:59:00     20.6                 30.0
2000-02-15 16:51:00     20.3                120.0
2000-02-15 17:52:00     22.6                200.0
input dataframe B
                      value_b  seconds_since_start
valid_b
2019-12-24 14:54:00     12.4                 20.0
2019-12-24 15:54:00     18.7                 30.0
2019-12-24 16:54:00     19.2                 90.0
2019-12-24 17:54:00     20.8                250.0
desired output dataframe
                valid_a              valid_b  ...  value_b  seconds_since_start
0  2000-02-15 14:47:00                  NaN  ...      NaN                  0.0
1                  NaN  2019-12-24 14:54:00  ...     12.4                 20.0
2  2000-02-15 15:59:00  2019-12-24 15:54:00  ...     18.7                 30.0
3                  NaN  2019-12-24 16:54:00  ...     19.2                 90.0
4  2000-02-15 16:51:00                  NaN  ...      NaN                120.0
5  2000-02-15 17:52:00                  NaN  ...      NaN                200.0
6                  NaN  2019-12-24 17:54:00  ...     20.8                250.0

[7 rows x 5 columns]
actual output
                      value_a  seconds_since_start  value_b
2000-02-15 14:47:00     12.3                  0.0      NaN
2019-12-24 14:54:00      NaN                 20.0     12.4
2000-02-15 15:59:00     20.6                 30.0      NaN
2019-12-24 15:54:00      NaN                 30.0     18.7
2019-12-24 16:54:00      NaN                 90.0     19.2
2000-02-15 16:51:00     20.3                120.0      NaN
2000-02-15 17:52:00     22.6                200.0      NaN
2019-12-24 17:54:00      NaN                250.0     20.8

最佳答案

它只是合并:

pd.merge(df_a.reset_index(),
         df_b.reset_index(),
         on='seconds_since_start',
         how='outer')


输出:

    valid_a                value_a    seconds_since_start  valid_b                value_b
--  -------------------  ---------  ---------------------  -------------------  ---------
 0  2000-02-15 14:47:00       12.3                      0  NaT                      nan
 1  2000-02-15 15:59:00       20.6                     30  2019-12-24 15:54:00       18.7
 2  2000-02-15 16:51:00       20.3                    120  NaT                      nan
 3  2000-02-15 17:52:00       22.6                    200  NaT                      nan
 4  NaT                      nan                       20  2019-12-24 14:54:00       12.4
 5  NaT                      nan                       90  2019-12-24 16:54:00       19.2
 6  NaT                      nan                      250  2019-12-24 17:54:00       20.8

07-24 09:38
查看更多