问题描述
我有以下数据框有一个棘手的问题:
I have the following dataframe with a tricky problem:
Disease State Month Value
Covid Texas 2020-03 2
Covid Texas 2020-04 3
Covid Texas 2020-05 4
Covid Texas 2020-08 3
Cancer Florida 2020-04 4
Covid Florida 2020-03 6
Covid Florida 2020-04 4
Flu Florida 2020-03 5
我必须列出连续 3 个月的值并创建一个新的数据框.但是,有一些条件:
I have to make a list of values for 3 consecutive months and create a new dataframe.However, there are some conditions:
将为每个疾病、每个月(从开始到结束:2020 年 2 月 - 2021 年 4 月)和每个州创建列表.
The list will be created for each disease, each month (from start to end: 2020 Feb - 2021 April) and each state.
如果数据集中不存在任何特定月份,则会创建该月份的行并且该月份的值为 0.
If any specific month is absent in the dataset, row for that month would be created and the value for that month would be 0.
所需的输出:
Disease State Month ValueList
Covid Texas 2020-02 [0, 2, 3] (no dataset for Feb 20 but next two months are)
Covid Texas 2020-03 [2, 3, 4] (has values for 3 consecutive months)
Covid Texas 2020-04 [3, 4, 0] (doesn’t have value for 6th month)
Covid Texas 2020-05 [4, 0, 0] (has value for present month)
Covid Texas 2020-06 [0, 0, 3] (has value for 8th month)
Covid Texas 2020-07 [0, 3, 0] (has value for 8th month)
Covid Texas 2020-08 [3, 0, 0] (has value for present month)
Covid Texas 2020-09 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2020-10 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2020-11 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2020-12 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-01 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-02 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-03 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-04 [0, 0, 0] (no dataset for next 3 months)
我正在尝试使用此方法填写日期:
I am trying to fill in dates using this:
df3= (df2.set_index('MonthEnd')
.groupby(['Disease', 'State']).apply(lambda x: x.drop(['Disease', 'State'], axis=1).asfreq('D'))
.reset_index())
但是,它不会为每个组返回相同的时间范围.它返回该组中最小和最大日期之间的值.
However, it doesn't returns the same time frame for each group. It returns the values between the min and max date in that group.
我不确定我应该如何开始.任何帮助,将不胜感激.谢谢!
I’m not sure how I should start. Any help would be appreciated. Thanks!
推荐答案
让我们从简单的逻辑开始.所以基本上你想为每个组创建从 Feb 2020
到 Apr 2021
的日期范围.
Let's start with simple logic. So basically you want to create date range from Feb 2020
to Apr 2021
for each group.
让我们选取每个组并使用 reindex 添加此日期范围.完成添加日期范围后,我将填充数据,然后执行滚动功能以获取 3 个连续值(考虑前一个和当前一个)并将其转换为列表.
let's take each group and add this date range using reindex. Once I am done with adding the date range now I will fill the data and then will perform rolling function to get the 3 consecutive values(considering previous one and current one) and convert it into the list.
我会将这些列表值分配给我的 ValueList
列.然后我会将所有这些修改后的组添加到数据框中.
I will assign these list of list values to my ValueList
column.Then I will add all these modified groups to dataframe.
解决方案:
df.Month = pd.to_datetime(df.Month, format="%Y-%m")
df.set_index('Month',inplace=True)
def add_elem(li): # this is to add 0 elements if rolling function is not getting 2 previous rows.
n = (3-len(li))
if n<3:
li = [0]*n +li
return li
start = '2020-02'
end = '2021-04'
data = pd.DataFrame()
for i,grp in df.groupby(['Disease', 'State']):
grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
grp = (grp.fillna(0))
grp['Value'] = grp['Value'].astype(int)
grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
data = data.append(grp)
或
使用apply
:
def fill_date(grp):
grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
grp = (grp.fillna(0))
grp['Value'] = grp['Value'].astype(int)
grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
return grp
data = df.groupby(['Disease', 'State'], as_index=False).apply(fill_date)
数据:
疾病 | 状态 | 值 | ValueList | |
---|---|---|---|---|
2020-02-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-03-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-04-01 | 癌症 | 佛罗里达 | 4 | [0, 0, 4] |
2020-05-01 | 癌症 | 佛罗里达 | 0 | [0, 4, 0] |
2020-06-01 | 癌症 | 佛罗里达 | 0 | [4, 0, 0] |
2020-07-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-08-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-09-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-10-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-11-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-12-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2021-01-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2021-02-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2021-03-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2021-04-01 | 癌症 | 佛罗里达 | 0 | [0, 0, 0] |
2020-02-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2020-03-01 | Covid | 佛罗里达 | 6 | [0, 0, 6] |
2020-04-01 | Covid | 佛罗里达 | 4 | [0, 6, 4] |
2020-05-01 | Covid | 佛罗里达 | 0 | [6, 4, 0] |
2020-06-01 | Covid | 佛罗里达 | 0 | [4, 0, 0] |
2020-07-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2020-08-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2020-09-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2020-10-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2020-11-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2020-12-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2021-01-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2021-02-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2021-03-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2021-04-01 | Covid | 佛罗里达 | 0 | [0, 0, 0] |
2020-02-01 | Covid | 德克萨斯 | 0 | [0, 0, 0] |
2020-03-01 | Covid | 德克萨斯 | 2 | [0, 0, 2] |
2020-04-01 | Covid | 德克萨斯 | 3 | [0, 2, 3] |
2020-05-01 | Covid | 德克萨斯 | 4 | [2, 3, 4] |
2020-06-01 | Covid | 德克萨斯 | 0 | [3, 4, 0] |
2020-07-01 | Covid | 德克萨斯 | 0 | [4, 0, 0] |
2020-08-01 | Covid | 德克萨斯 | 3 | [0, 0, 3] |
2020-09-01 | Covid | 德克萨斯 | 0 | [0, 3, 0] |
2020-10-01 | Covid | 德克萨斯 | 0 | [3, 0, 0] |
2020-11-01 | Covid | 德克萨斯 | 0 | [0, 0, 0] |
2020-12-01 | Covid | 德克萨斯 | 0 | [0, 0, 0] |
2021-01-01 | Covid | 德克萨斯 | 0 | [0, 0, 0] |
2021-02-01 | Covid | 德克萨斯 | 0 | [0, 0, 0] |
2021-03-01 | Covid | 德克萨斯 | 0 | [0, 0, 0] |
2021-04-01 | Covid | 德克萨斯 | 0 | [0, 0, 0] |
2020-02-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2020-03-01 | 流感 | 佛罗里达 | 5 | [0, 0, 5] |
2020-04-01 | 流感 | 佛罗里达 | 0 | [0, 5, 0] |
2020-05-01 | 流感 | 佛罗里达 | 0 | [5, 0, 0] |
2020-06-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2020-07-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2020-08-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2020-09-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2020-10-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2020-11-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2020-12-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2021-01-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2021-02-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2021-03-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
2021-04-01 | 流感 | 佛罗里达 | 0 | [0, 0, 0] |
这篇关于使用跨不同时间轴和位置的多个条件创建新数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!