问题描述
我有以下DataFrame,一个带有周期索引的每周价格数据时间表.我们称之为df
I have the following DataFrame, a weekly price data timeserie with a Period Index. Let's call it df
timestamp open high low close volume
timestamp
2009-02-01/2009-02-07 733442.166309 830.540773 832.586910 828.788627 830.706009 48401.952790
2009-02-08/2009-02-14 733449.166309 839.945279 841.763948 837.812232 839.742489 53429.330472
2009-02-15/2009-02-21 733456.245777 790.733108 792.399775 788.897523 790.549550 50671.887387
2009-02-22/2009-02-28 733463.166309 760.586910 762.640558 758.234979 760.428112 60565.506438
如果我尝试使用df.resample('30min').mean()
对其重新采样,则数据以2009-02-22
结尾.我希望它以2009-02-28
结尾,同时仍以2009-02-01
开始.我该怎么办?
我怀疑这与resample
函数的closed
和label
值有关,但是这些在文档中没有得到很好的解释.
If I try to resample it with df.resample('30min').mean()
the data ends at 2009-02-22
. I would like it to end at 2009-02-28
, while still starting at 2009-02-01
. How can I do that?
I suspect it has to do with the closed
and label
values of the resample
function, but those are not very well explained in the doc.
以下是用于重构数据帧的代码段:
Here a snippet to reconstruct the dataframe:
import pandas as pd
from pandas import Period
dikt={'volume': {Period('2009-02-01/2009-02-07', 'W-SAT'): 48401.952789699571, Period('2009-02-08/2009-02-14', 'W-SAT'): 53429.330472103007, Period('2009-02-15/2009-02-21', 'W-SAT'): 50671.887387387389, Period('2009-02-22/2009-02-28', 'W-SAT'): 60565.506437768243}, 'close': {Period('2009-02-01/2009-02-07', 'W-SAT'): 830.70600858369096, Period('2009-02-08/2009-02-14', 'W-SAT'): 839.74248927038627, Period('2009-02-15/2009-02-21', 'W-SAT'): 790.54954954954951, Period('2009-02-22/2009-02-28', 'W-SAT'): 760.42811158798281}, 'open': {Period('2009-02-01/2009-02-07', 'W-SAT'): 830.54077253218884, Period('2009-02-08/2009-02-14', 'W-SAT'): 839.94527896995703, Period('2009-02-15/2009-02-21', 'W-SAT'): 790.73310810810813, Period('2009-02-22/2009-02-28', 'W-SAT'): 760.58690987124464}, 'high': {Period('2009-02-01/2009-02-07', 'W-SAT'): 832.58690987124464, Period('2009-02-08/2009-02-14', 'W-SAT'): 841.76394849785413, Period('2009-02-15/2009-02-21', 'W-SAT'): 792.39977477477476, Period('2009-02-22/2009-02-28', 'W-SAT'): 762.64055793991417}, 'low': {Period('2009-02-01/2009-02-07', 'W-SAT'): 828.78862660944208, Period('2009-02-08/2009-02-14', 'W-SAT'): 837.8122317596567, Period('2009-02-15/2009-02-21', 'W-SAT'): 788.89752252252254, Period('2009-02-22/2009-02-28', 'W-SAT'): 758.23497854077254}, 'timestamp': {Period('2009-02-01/2009-02-07', 'W-SAT'): 733442.16630901292, Period('2009-02-08/2009-02-14', 'W-SAT'): 733449.16630901292, Period('2009-02-15/2009-02-21', 'W-SAT'): 733456.24577702698, Period('2009-02-22/2009-02-28', 'W-SAT'): 733463.16630901292}}
pd.DataFrame(dikt, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
推荐答案
由于要包含与第一个PeriodIndex
对应的start_time
和与最后一个PeriodIndex
对应的end_time
,因此 DF.resample
在这里几乎没有帮助,因为它们本质上是整体/互斥的(意味着更改任何arg都会影响start_time
或end_time
,但不会同时影响两者).
Since you want to include the start_time
corresponding to the first PeriodIndex
and end_time
corresponding to the last one, the keyword arguments present in DF.resample
would be of little help here as these operate as a whole/mutually exclusive in nature (meaning altering any arg would affect either the start_time
or end_time
but not both).
相反,您可以对这些样本进行降采样以采用每天的频率,"D"
,然后在30分钟内对每个组进行均值汇总.
Instead, you could downsample these to take on the day frequency, "D"
and then perform the aggregation of mean for each group within 30 minutes.
df.resample('D').asfreq().resample('30T').mean()
如果要专门对start_time
或end_time
进行重采样,则可以使用convention
arg.
The convention
arg could have been used if resampling across start_time
or end_time
specifically were to be performed.
要检查:
To check:
resamp_start = df.resample('30min').mean()
resamp_all = df.resample('D').asfreq().resample('30T').mean().head(resamp_start.shape[0])
resamp_start.equals(resamp_all)
True
如果仅需要重新采样的索引而不是其汇总,则将其当前频率下采样到与要在中重新采样的频率相对应的最低整数频率是有意义的[此处为1分钟] ,然后每30行取一个切片,以每 30分钟个样本进行计算.
If you require only the resampled index and not it's aggregation, then it would make sense to down-sample it's current frequency to the lowest integer frequency corresponding to the frequency that is to be resampled for [Here, 1 minute] and then take slices of every 30 rows to compute this for every 30 minute sample.
df.resample('T').asfreq().iloc[::30]
与较早的情况相比,这些操作会为您提供整个2009-02-28
的样本,在早期情况下,由于在.resample('D')
操作过程中对其进行了归一化(时间调整为午夜),因此考虑了不超过2009-02-28
的日期
These would give you the samples for the whole of 2009-02-28
as compared to the earlier case where the dates upto and not including 2009-02-28
were considered due to their normalization (times adjusted to midnight) imposed during .resample('D')
operation.
这篇关于重新采样/上采样周期指数,并同时使用两个极端时间“边缘".数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!