我正在尝试在带有月度数据的 Pandas 数据框上使用滚动 () 函数.但是,我删除了一些 NaN 值,所以现在我的时间序列中有一些差距.因此,基本窗口参数给出了一个误导性的答案,因为它只是查看了之前的观察结果:
I'm trying to use the rolling() function on a pandas data frame with monthly data. However, I dropped some NaN values, so now there are some gaps in my time series. Therefore, the basic window parameter gives a misleading answer since it just looks at the previous observation:
import pandas as pd
import numpy as np
import random
dft = pd.DataFrame(np.random.randint(0,10,size=len(dt)),index=dt)
dft.columns = ['value']
dft['value'] = np.where(dft['value'] < 3,np.nan,dft['value'])
dft = dft.dropna()
dft['basic'] = dft['value'].rolling(2).sum()
例如,参见 2017-08-31 条目,其总和为 3.0 和 9.0,但上一个条目是 2017-03-31.
See, for example the 2017-08-31 entry, which sums 3.0 and 9.0, but the previous entry is 2017-03-31.
In [57]: dft.tail()
value basic
2017-02-28 8.0 12.0
2017-03-31 3.0 11.0
2017-08-31 9.0 12.0
2017-10-31 7.0 16.0
2017-11-30 7.0 14.0
The natural solution (I thought) is to use a '2M' offset, but it gives an error:
In [58]: dft['basic2M'] = dft['value'].rolling('2M').sum()
...<output omitted>...
ValueError: <2 * MonthEnds> is a non-fixed frequency
If I move the Daily offset, I can get it to work, but this seems like an odd workaround:
In [59]: dft['basic32D'] = dft['value'].rolling('32D', min_periods=2).sum()
In [61]: dft.tail()
value basic basic32D
2017-02-28 8.0 12.0 12.0
2017-03-31 3.0 11.0 11.0
2017-08-31 9.0 12.0 NaN
2017-10-31 7.0 16.0 NaN
2017-11-30 7.0 14.0 14.0
我也尝试转换为 PeriodIndex:
I also tried converting to a PeriodIndex:
dfp = dft.to_period(freq='M')
but this gives the same error:
dfp['basic2M'] = dfp['value'].rolling('2M').sum()
dfp['basic32Dp'] = dfp['value'].rolling('32D', min_periods=2).sum()
In [68]: dfp
value basic basic32D basic32Dp
2016-02 9.0 NaN NaN NaN
2016-03 3.0 12.0 12.0 12.0
2016-04 7.0 10.0 10.0 19.0
2016-05 3.0 10.0 10.0 22.0
2016-06 4.0 7.0 7.0 26.0
2016-07 7.0 11.0 11.0 33.0
2016-08 3.0 10.0 10.0 36.0
2016-09 9.0 12.0 12.0 45.0
2016-11 5.0 14.0 NaN 50.0
2017-01 4.0 9.0 NaN 54.0
2017-02 8.0 12.0 12.0 62.0
2017-03 3.0 11.0 11.0 65.0
2017-08 9.0 12.0 NaN 74.0
2017-10 7.0 16.0 NaN 81.0
2017-11 7.0 14.0 14.0 88.0
'32D' 与 'M' 周期索引的偏移量似乎被视为 '32M' 吗?它似乎只是整个系列的一个不断扩大的总和.
The '32D' offset with the 'M' period index seems to be treated as '32M' perhaps? It appears to just be an expanding sum for the entire series.
也许我误解了如何使用偏移量?显然,我可以通过将 NaN 保留在原始 value
列中并仅使用 window 参数来解决这个问题,但偏移量似乎非常有用.
Perhaps I'm misunderstanding how to use offsets? Obviously, I could solve this by keeping the NaN in the original value
column and just use the window parameter, but offsets seem quite useful.
就其价值而言,如果我使用 DateTimeIndex 生成每小时数据,事情似乎按预期工作(即每 12 小时使用数据的2D"偏移量在丢失的行中给出正确答案).
For what its worth, if I generate Hourly data with a DateTimeIndex, things seem to work as expected (i.e. a '2D' offset with data every 12 hours gives the correct answer across missing rows).
Here is a function that gives you the rolling sum of a specified number of months. You did not provide variable 'dt' in your code above so I just created a list of datetimes (code included).
from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
import numpy as np
import random
def date_range(start_date, end_date, increment, period):
result = []
nxt = start_date
delta = relativedelta(**{period:increment})
while nxt <= end_date:
nxt += delta
return result
def MonthRollSum(df, offset, sumColumn):
#must have DateTimeIndex
df2 = df.copy()
df2.index = df2.index + pd.DateOffset(days = -offset)
return df2.groupby([df2.index.year, df2.index.month])[sumColumn].sum()
# added this part to generate the dt list for 8hour interval for 1000 days
start_date = datetime.now()
end_date = start_date + relativedelta(days=1000)
end_date = end_date.replace(hour=19, minute=0, second=0, microsecond=0)
dt = date_range(start_date, end_date, 8, 'hours')
# the following was given by the questioner
dft = pd.DataFrame(np.random.randint(0,10,size=len(dt)),index=dt)
dft.columns = ['value']
dft['value'] = np.where(dft['value'] < 3,np.nan,dft['value'])
dft = dft.dropna()
# Call the solution function
dft = MonthRollSum(dft, 2, 'value')
The results many vary because the initial list of value is randomly generated:
2021 2 290.0
3 379.0
4 414.0
5 368.0
6 325.0
7 405.0
8 425.0
9 380.0
10 393.0
11 370.0
12 419.0
2022 1 377.0
2 275.0
3 334.0
4 350.0
5 395.0
6 376.0
7 420.0
8 419.0
9 359.0
10 328.0
11 394.0
12 345.0
2023 1 381.0
2 335.0
3 352.0
4 355.0
5 376.0
6 350.0
7 401.0
8 443.0
9 394.0
10 394.0
这篇关于具有每月偏移量的 pandas 滚动()函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!