前两个表下面是cat1cat2
基本上,我希望将cat1cat2保留在一起,其中cat1日期在cat2的开始/结束日期之间。
最后一张表格是结果,我应该如何有效地实现这一目标?请注意,我在真实表中有10万条记录。
顺便说一句,如果我们的日期cat1落在两个StartDate和EndDate(两行cat2)中,那么它将最终在输出表中包含两行。

python - 日期的左连接与另一个表的开始和结束日期-LMLPHP

请在下面找到cat1cat2的代码

import pandas as pd
from datetime import date

cat1=pd.DataFrame({
    "CatID":[1, 2, 3],
    "Date":[date(2019,10,1), date(2019,11,1), date(2019,12,1)]
})

cat2=pd.DataFrame({
    "Interesting":[1, 2, 3,4],
    "StartDate":[date(2019,10,1), date(2019,11,1),date(2019,11,1), date(2018,10,6)],
    "EndDate":[date(2019,10,30), date(2019,11,30),date(2019,11,30), date(2018,10,6)]
})



我在这里尝试了解决方案
How to merge two dataframe with column in overlapping datetime range

# df2[["start"]] is a column vector of size m and df1.timestamp.values is row
## vector of size n then broad cast will result matrix of shape m,n which is
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)

## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(cat1)*len(cat2))[compare.ravel()]


## calculate row and column index from cell number
pd.concat([cat2.iloc[ind//len(cat1)].reset_index(drop=True), cat1.iloc[ind%len(cat1)].reset_index(drop=True)], axis=1, sort=False)


我得到了

python - 日期的左连接与另一个表的开始和结束日期-LMLPHP

它不是我想要的,因为它缺少最后一行(因为我需要的是左连接)。

最佳答案

您产生的是“内部”联接。在连接右列之前,将它们构建为单独的数据框架。确保行数相等。

# df2[["start"]] is a column vector of size m and df1.timestamp.values is row
## vector of size n then broad cast will result matrix of shape m,n which is
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)


## get cell numbers which is in range 0 to matrix size which meets the condition
ind_old = np.arange(len(cat1)*len(cat2))[compare.ravel()]
ind_new = np.arange(len(cat1)*len(cat2))[compare.T.ravel()]
cat1_new = pd.DataFrame(np.nan, index=list(cat2.index), columns=cat1.columns)

## match the rows of cat1 to those of cat2
cat1_new.iloc[ind_new%len(cat2)] = cat1.iloc[ind_old%len(cat1)].reset_index(drop=True)


## calculate row and column index from cell number
print(pd.concat([cat2, cat1_new], axis=1, sort=False))

## concat
print(pd.concat([cat2, cat1_new], axis=1, sort=False))

#    Interesting   StartDate     EndDate  CatID        Date
# 0            1  2019-10-01  2019-10-30    1.0  2019-10-01
# 1            2  2019-11-01  2019-11-30    2.0  2019-11-01
# 2            3  2019-11-01  2019-11-30    2.0  2019-11-01
# 3            4  2018-10-06  2018-10-06    NaN         NaN

关于python - 日期的左连接与另一个表的开始和结束日期,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58955865/

10-11 07:39
查看更多