本文介绍了是否有一个函数可以获取 pandas 数据帧时间序列上两个值之间的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在

所需的示例输出:

  pd.DataFrame(data = {'state':[['Alabama','Alabama','Alabama','Alabama','Alabama'],'日期':[日期日期(2020,3,13),日期日期(2020,3,14),日期日期(2020,3,15),日期日期(2020,3,16),日期.date(2020,3,17)],'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]}) 

从原始NYT数据集中重新创建示例数据:

  df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates= ['date'])df.groupby(['state','date'])[['cases']].mean().reset_index() 

任何帮助将不胜感激!想要学习如何手动/通过功能来执行此操作,而不是查找新情况".数据集,因为在不久的将来我将大量使用时间序列.

解决方案

diff函数正确,但是如果您查看错误消息:

 'DatetimeIndexResampler'对象没有属性'diff' 

在您第一个尝试过的方法中,这是因为diff是适用于DataFrames的功能,而不适用于Resamplers,因此请通过指定要对其进行重新采样的方式将其转换回DataFrame.

如果您每天都有COVID案件的总数,并且希望将其重新抽样到2天,则您可能只想保留两天内的最新更新,在这种情况下,例如 df.resample('2d').last().diff()应该可以工作.

I am messing around in the NYT covid dataset which has total covid cases for each county, per day.

I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache.

Tried methods:

  • df.resample('2d').diff()
  • df.resample('1d').agg(np.subtract)
  • df.rolling(2).diff()
  • df.rolling('2').agg(np.subtract)

Sample data:

pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
               'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
               'covid_cases':[1.2,2.0,2.9,3.6,3.9]
              })

Desired sample output:

pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
               'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
               'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
              })

Recreate sample data from original NYT dataset:

df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()

Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.

解决方案

The diff function is correct, but if you look at your error message:

'DatetimeIndexResampler' object has no attribute 'diff'

in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.

If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.

这篇关于是否有一个函数可以获取 pandas 数据帧时间序列上两个值之间的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 11:10
查看更多