如何基于时间延迟连接 pandas 中的两个表

本文介绍了如何基于时间延迟连接 pandas 中的两个表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我实际上有两个CSV文件df1和df2.

I have actually two CSV files, df1 and df2.

当我使用命令时: df1 = pd.read_csv("path"，index_col ="created_at"，parse_dates = ["created_at"])

When I use the command: df1=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])

我得到:

                      index   likes    ...      user_screen_name  sentiment
created_at                            ...
2019-02-27 05:36:29      0   94574    ...       realDonaldTrump   positive
2019-02-27 05:31:21      1   61666    ...       realDonaldTrump   negative
2019-02-26 18:08:14      2  151844    ...       realDonaldTrump   positive
2019-02-26 04:50:37      3  184597    ...       realDonaldTrump   positive
2019-02-26 04:50:36      4  181641    ...       realDonaldTrump   negative
       ...             ...    ...     ...           ...             ...

当我使用命令时:

df2=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])

我得到:

                     Unnamed: 0    Close     Open  Volume     Day
created_at
2019-03-01 00:47:00           0  2784.49  2784.49     NaN  STABLE
2019-03-01 00:21:00           1  2784.49  2784.49     NaN  STABLE
2019-03-01 00:20:00           2  2784.49  2784.49     NaN  STABLE
2019-03-01 00:19:00           3  2784.49  2784.49     NaN  STABLE
2019-03-01 00:18:00           4  2784.49  2784.49     NaN  STABLE
2019-03-01 00:17:00           5  2784.49  2784.49     NaN  STABLE
        ...                 ...    ...      ...       ...    ...

如您所知，当您使用命令时:

As you know, when you use the command:

df3=df1.join(df2)

您将基于索引"created_at"将两个表与两个表中的确切日期和时间连接起来.

You will join the two tables based on the index "created_at" with the exact date and time in the two tables.

但是我希望得到一个结果，例如延迟2分钟.

But I would like to have the result, with a delay, for an example, of 2 min.

例如，代替:

file df1                   file df2
created_at                 created_at
2019-02-27 05:36:29        2019-02-27 05:36:29

我想像这样将两个表连接起来:

I would like to have the two tables join like this:

file df1                   file df2
created_at                 created_at
2019-02-27 05:36:29        2019-02-27 05:38:29

对于我的数据来说，时间df1在df2之前很重要.我的意思是，事件df1在df2之前很重要.

It is important for my data that the time df1 is before df2. I mean it is important that the event df1 is before df2.

推荐答案

对于小型数据框，根据两个日期之间的日期合并两个数据框其他没有公用栏的日期包含一个不错的解决方案.简单地，它使用两个数据框的笛卡尔积，而不能与较大的数据框很好地缩放.

For small dataframes, Merging two dataframes based on a date between two other dates without a common column contains a nice solution. Simply it uses a cartesian product of both data frames, and will not scale nicely with larger data frames.

一种可能的优化方法是将 rounded datetime列添加到数据框，然后加入这些列.由于联接比笛卡尔乘积更有效，因此内存和执行时间的增加应该引人注目.

A possible optimization would be to add rounded datetime columns to the dataframes, and join on those columns. As a join is very more efficient than a cartesian product, the memory and execution time gain should be noticeable.

您想要的是(这里的伪代码):

What you want is (pseudo code here):

df1.created_at <= df2.created_at and df2.created_at - df1.created_at <= 2mins

我将在两个数据帧中添加一个定义为(仍然是伪代码)的ref列:created_at - (created_at.minute % 2)

I would add in both dataframes a ref column defined as (still pseudo code): created_at - (created_at.minute % 2)

两个数据帧中的行共享相同的参考值，它们的日期应少于4分钟.但这不会选择所有预期的情况，因为日期可能会少于2分钟，并且落在2个不同的位置.为了解决这个问题，我建议在df1中将ref2列定义为ref1 + 2minutes，并在df1.ref == df1.ref2上进行第二次联接.这样就足够了，因为您希望df1事件早于df2事件，否则我们将需要第三列ref3 = ref1 - 2minutes.

It lines in both dataframes share the same ref value, they should have dates distant from less that 4 minutes. But this will not pick all the expected cases, because dates can be closer than 2 minutes and fall in 2 different slots. To cope with that, I suggest to have a ref2 column in df1 defined as ref1 + 2minutes and do a second join on df1.ref == df1.ref2. It will be enough because you want the df1 event to be before df2 one, else we would need a 3rd column ref3 = ref1 - 2minutes.

然后，如参考答案中所述，我们可以选择实际满足要求的行并联系两个连接的数据框.

Then as in the referenced answer, we can select the lines actually meeting the requirement and contact the two joined data frames.

熊猫代码可以是:

# create auxilliary columns
df1['ref'] = df1.index - pd.to_timedelta(df1.index.minute % 2, unit='m')
df1['ref2'] = df1.ref + pd.Timedelta(minutes=2)

df2['ref'] = df2.index - pd.to_timedelta(df2.index.minute % 2, unit='m')
df2.index.name = 'created_at_2'
df2 = df2.reset_index().set_index('ref')

# join on ref and select the relevant lines
x1 = df1.join(df2, on='ref', how='inner')
x1 = x1.loc[(x1.index <= x1.created_at_2)
            & (x1.created_at_2 - x1.index <= pd.Timedelta(minutes=2))]

# join on ref2 and select the relevant lines
x2 = df1.join(df2, on='ref2', how='inner')
x2 = x2.loc[(x2.index <= x2.created_at_2)
            & (x2.created_at_2 - x2.index <= pd.Timedelta(minutes=1))]

# concatenate the partial result and clean the resulting dataframe
merged = pd.concat([x1, x2]).drop(columns=['ref', 'ref2'])
merged.index.name = 'created_at'

这篇关于如何基于时间延迟连接 pandas 中的两个表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！