前两个表下面是cat1
和cat2
表
基本上,我希望将cat1
与cat2
保留在一起,其中cat1
日期在cat2
的开始/结束日期之间。
最后一张表格是结果,我应该如何有效地实现这一目标?请注意,我在真实表中有10万条记录。
顺便说一句,如果我们的日期cat1
落在两个StartDate和EndDate(两行cat2
)中,那么它将最终在输出表中包含两行。
请在下面找到cat1
和cat2
的代码
import pandas as pd
from datetime import date
cat1=pd.DataFrame({
"CatID":[1, 2, 3],
"Date":[date(2019,10,1), date(2019,11,1), date(2019,12,1)]
})
cat2=pd.DataFrame({
"Interesting":[1, 2, 3,4],
"StartDate":[date(2019,10,1), date(2019,11,1),date(2019,11,1), date(2018,10,6)],
"EndDate":[date(2019,10,30), date(2019,11,30),date(2019,11,30), date(2018,10,6)]
})
我在这里尝试了解决方案
How to merge two dataframe with column in overlapping datetime range
# df2[["start"]] is a column vector of size m and df1.timestamp.values is row
## vector of size n then broad cast will result matrix of shape m,n which is
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(cat1)*len(cat2))[compare.ravel()]
## calculate row and column index from cell number
pd.concat([cat2.iloc[ind//len(cat1)].reset_index(drop=True), cat1.iloc[ind%len(cat1)].reset_index(drop=True)], axis=1, sort=False)
我得到了
它不是我想要的,因为它缺少最后一行(因为我需要的是左连接)。
最佳答案
您产生的是“内部”联接。在连接右列之前,将它们构建为单独的数据框架。确保行数相等。
# df2[["start"]] is a column vector of size m and df1.timestamp.values is row
## vector of size n then broad cast will result matrix of shape m,n which is
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)
## get cell numbers which is in range 0 to matrix size which meets the condition
ind_old = np.arange(len(cat1)*len(cat2))[compare.ravel()]
ind_new = np.arange(len(cat1)*len(cat2))[compare.T.ravel()]
cat1_new = pd.DataFrame(np.nan, index=list(cat2.index), columns=cat1.columns)
## match the rows of cat1 to those of cat2
cat1_new.iloc[ind_new%len(cat2)] = cat1.iloc[ind_old%len(cat1)].reset_index(drop=True)
## calculate row and column index from cell number
print(pd.concat([cat2, cat1_new], axis=1, sort=False))
## concat
print(pd.concat([cat2, cat1_new], axis=1, sort=False))
# Interesting StartDate EndDate CatID Date
# 0 1 2019-10-01 2019-10-30 1.0 2019-10-01
# 1 2 2019-11-01 2019-11-30 2.0 2019-11-01
# 2 3 2019-11-01 2019-11-30 2.0 2019-11-01
# 3 4 2018-10-06 2018-10-06 NaN NaN
关于python - 日期的左连接与另一个表的开始和结束日期,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58955865/