问题描述
我在
所需的示例输出:
pd.DataFrame(data = {'state':[['Alabama','Alabama','Alabama','Alabama','Alabama'],'日期':[日期日期(2020,3,13),日期日期(2020,3,14),日期日期(2020,3,15),日期日期(2020,3,16),日期.date(2020,3,17)],'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]})
从原始NYT数据集中重新创建示例数据:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates= ['date'])df.groupby(['state','date'])[['cases']].mean().reset_index()
任何帮助将不胜感激!想要学习如何手动/通过功能来执行此操作,而不是查找新情况".数据集,因为在不久的将来我将大量使用时间序列.
diff函数正确,但是如果您查看错误消息:
'DatetimeIndexResampler'对象没有属性'diff'
在您第一个尝试过的方法中,这是因为diff是适用于DataFrames的功能,而不适用于Resamplers,因此请通过指定要对其进行重新采样的方式将其转换回DataFrame.
如果您每天都有COVID案件的总数,并且希望将其重新抽样到2天,则您可能只想保留两天内的最新更新,在这种情况下,例如 df.resample('2d').last().diff()
应该可以工作.
I am messing around in the NYT covid dataset which has total covid cases for each county, per day.
I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache.
Tried methods:
df.resample('2d').diff()
df.resample('1d').agg(np.subtract)
df.rolling(2).diff()
df.rolling('2').agg(np.subtract)
Sample data:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'covid_cases':[1.2,2.0,2.9,3.6,3.9]
})
Desired sample output:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
})
Recreate sample data from original NYT dataset:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()
Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.
The diff function is correct, but if you look at your error message:
'DatetimeIndexResampler' object has no attribute 'diff'
in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.
If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff()
should work.
这篇关于是否有一个函数可以获取 pandas 数据帧时间序列上两个值之间的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!