本文介绍了使用跨不同时间线和位置的多个条件创建新数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框有一个棘手的问题:

I have the following dataframe with a tricky problem:

Disease  State       Month      Value
Covid    Texas     2020-03        2
Covid    Texas     2020-04        3
Covid    Texas     2020-05        4
Covid    Texas     2020-08        3
Cancer   Florida   2020-04        4
Covid    Florida   2020-03        6
Covid    Florida   2020-04        4
Flu      Florida   2020-03        5

我必须连续 3 个月列出值并创建一个新数据框.但是,有一些条件:

I have to make a list of values for 3 consecutive months and create a new dataframe.However, there are some conditions:

  1. 将为每种疾病、每个月(从开始到结束:2020 年 2 月至 2021 年 4 月)和每个州创建列表.

  1. The list will be created for each disease, each month (from start to end: 2020 Feb - 2021 April) and each state.

如果数据集中没有任何特定月份,将创建该月的行,该月的值为 0.

If any specific month is absent in the dataset, row for that month would be created and the value for that month would be 0.

期望的输出:

Disease State    Month      ValueList
Covid   Texas    2020-02    [0, 2, 3] (no dataset for Feb 20 but next two months are)
Covid   Texas    2020-03    [2, 3, 4] (has values for 3 consecutive months)
Covid   Texas    2020-04    [3, 4, 0] (doesn’t have value for 6th month)
Covid   Texas    2020-05    [4, 0, 0] (has value for present month)
Covid   Texas    2020-06    [0, 0, 3] (has value for 8th month)
Covid   Texas    2020-07    [0, 3, 0] (has value for 8th month)
Covid   Texas    2020-08    [3, 0, 0] (has value for present month)
Covid   Texas    2020-09    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-10    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-11    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-12    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-01    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-02    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-03    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-04    [0, 0, 0] (no dataset for next 3 months)

我正在尝试使用这个来填写日期:

I am trying to fill in dates using this:

df3= (df2.set_index('MonthEnd')
   .groupby(['Disease', 'State']).apply(lambda x: x.drop(['Disease', 'State'], axis=1).asfreq('D'))
   .reset_index())

但是,它不会为每个组返回相同的时间范围.它返回该组中最小和最大日期之间的值.

However, it doesn't returns the same time frame for each group. It returns the values between the min and max date in that group.

我不确定我应该如何开始.任何帮助,将不胜感激.谢谢!

I’m not sure how I should start. Any help would be appreciated. Thanks!

推荐答案

让我们从简单的逻辑开始.所以基本上你想为每个组创建从 Feb 2020Apr 2021 的日期范围.

Let's start with simple logic. So basically you want to create date range from Feb 2020 to Apr 2021 for each group.

让我们使用重新索引来添加每个组并添加此日期范围.完成添加日期范围后,我将填充数据,然后执行滚动功能以获取 3 个连续值(考虑前一个值和当前值)并将其转换为列表.

let's take each group and add this date range using reindex. Once I am done with adding the date range now I will fill the data and then will perform rolling function to get the 3 consecutive values(considering previous one and current one) and convert it into the list.

我会将这些列表值列表分配给我的 ValueList 列.然后我会将所有这些修改后的组添加到数据框中.

I will assign these list of list values to my ValueList column.Then I will add all these modified groups to dataframe.

解决方案:

df.Month = pd.to_datetime(df.Month, format="%Y-%m")
df.set_index('Month',inplace=True)

def add_elem(li): # this is to add 0 elements if rolling function is not getting 2 previous rows.
    n = (3-len(li))
    if n<3:
        li = [0]*n +li
    return li


start = '2020-02'
end = '2021-04'

data = pd.DataFrame()
for i,grp in df.groupby(['Disease', 'State']):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    data = data.append(grp)


使用应用:

def fill_date(grp):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    return grp

 data = df.groupby(['Disease', 'State'], as_index=False).apply(fill_date)

数据:

疾病状态ValueList
2020-02-01癌症佛罗里达0[0, 0, 0]
2020-03-01癌症佛罗里达0[0, 0, 0]
2020-04-01癌症佛罗里达4[0, 0, 4]
2020-05-01癌症佛罗里达0[0, 4, 0]
2020-06-01癌症佛罗里达0[4, 0, 0]
2020-07-01癌症佛罗里达0[0, 0, 0]
2020-08-01癌症佛罗里达0[0, 0, 0]
2020-09-01癌症佛罗里达0[0, 0, 0]
2020-10-01癌症佛罗里达0[0, 0, 0]
2020-11-01癌症佛罗里达0[0, 0, 0]
2020-12-01癌症佛罗里达0[0, 0, 0]
2021-01-01癌症佛罗里达0[0, 0, 0]
2021-02-01癌症佛罗里达0[0, 0, 0]
2021-03-01癌症佛罗里达0[0, 0, 0]
2021-04-01癌症佛罗里达0[0, 0, 0]
2020-02-01新冠佛罗里达0[0, 0, 0]
2020-03-01新冠佛罗里达6[0, 0, 6]
2020-04-01新冠佛罗里达4[0, 6, 4]
2020-05-01新冠佛罗里达0[6, 4, 0]
2020-06-01新冠佛罗里达0[4, 0, 0]
2020-07-01新冠佛罗里达0[0, 0, 0]
2020-08-01新冠佛罗里达0[0, 0, 0]
2020-09-01新冠佛罗里达0[0, 0, 0]
2020-10-01新冠佛罗里达0[0, 0, 0]
2020-11-01新冠佛罗里达0[0, 0, 0]
2020-12-01新冠佛罗里达0[0, 0, 0]
2021-01-01新冠佛罗里达0[0, 0, 0]
2021-02-01新冠佛罗里达0[0, 0, 0]
2021-03-01新冠佛罗里达0[0, 0, 0]
2021-04-01新冠佛罗里达0[0, 0, 0]
2020-02-01新冠德州0[0, 0, 0]
2020-03-01新冠德州2[0, 0, 2]
2020-04-01新冠德州3[0, 2, 3]
2020-05-01新冠德州4[2, 3, 4]
2020-06-01新冠德州0[3, 4, 0]
2020-07-01新冠德州0[4, 0, 0]
2020-08-01新冠德州3[0, 0, 3]
2020-09-01新冠德州0[0, 3, 0]
2020-10-01新冠德州0[3, 0, 0]
2020-11-01新冠德州0[0, 0, 0]
2020-12-01新冠德州0[0, 0, 0]
2021-01-01新冠德州0[0, 0, 0]
2021-02-01新冠德州0[0, 0, 0]
2021-03-01新冠德州0[0, 0, 0]
2021-04-01新冠德州0[0, 0, 0]
2020-02-01流感佛罗里达0[0, 0, 0]
2020-03-01流感佛罗里达5[0, 0, 5]
2020-04-01流感佛罗里达0[0, 5, 0]
2020-05-01流感佛罗里达0[5, 0, 0]
2020-06-01流感佛罗里达0[0, 0, 0]
2020-07-01流感佛罗里达0[0, 0, 0]
2020-08-01流感佛罗里达0[0, 0, 0]
2020-09-01流感佛罗里达0[0, 0, 0]
2020-10-01流感佛罗里达0[0, 0, 0]
2020-11-01流感佛罗里达0[0, 0, 0]
2020-12-01流感佛罗里达0[0, 0, 0]
2021-01-01流感佛罗里达0[0, 0, 0]
2021-02-01流感佛罗里达0[0, 0, 0]
2021-03-01流感佛罗里达0[0, 0, 0]
2021-04-01流感佛罗里达0[0, 0, 0]

这篇关于使用跨不同时间线和位置的多个条件创建新数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 05:24