我有一个时间序列数据。但是数据有不连续性。(缺少 2005-03-02 02:08:00
)。
我需要一个新的列 C 使得 C(i)=A(i)+B(i)+average
,其中我的平均值是 B 的平均值,直到不连续性 (02:08:00)
。
average=Data.loc['2005-03-02 02:05:30':'2005-03-02 02:07:30',['B']].mean(axis=0)
After discontinuity we have to again recalculate average till next discontinuity
average=Data.loc['2005-03-02 02:08:30':'2005-03-02 02:11:00',['B']].mean(axis=0)
输入
Date,A,B
2005-03-02 02:05:30,1,3
2005-03-02 02:06:00,2,4
2005-03-02 02:06:30,3,5
2005-03-02 02:07:00,4,6
2005-03-02 02:07:30,5,7
2005-03-02 02:08:30,7,9
2005-03-02 02:09:00,7,9
2005-03-02 02:09:30,7,9
2005-03-02 02:10:00,8,12
2005-03-02 02:10:30,9,13
2005-03-02 02:11:00,10,14
输出
Date,A,B,C
2005-03-02 02:05:30,1,3,9
2005-03-02 02:06:00,2,4,11
2005-03-02 02:06:30,3,5,13
2005-03-02 02:07:00,4,6,15
2005-03-02 02:07:30,5,7,17
2005-03-02 02:08:30,7,9,28
2005-03-02 02:09:00,7,9,28
2005-03-02 02:09:30,7,9,28
2005-03-02 02:10:00,8,12,32
2005-03-02 02:10:30,9,13,34
2005-03-02 02:11:00,10,14,36
如何找出索引中的不连续性?
我怎样才能使用 Pandas 做所有的事情?
最佳答案
第 1 步:读入数据帧
import pandas as pd
from io import StringIO
y = '''Date,A,B
2005-03-02 02:05:30,1,3
2005-03-02 02:06:00,2,4
2005-03-02 02:06:30,3,5
2005-03-02 02:07:00,4,6
2005-03-02 02:07:30,5,7
2005-03-02 02:08:30,7,9
2005-03-02 02:09:00,7,9
2005-03-02 02:09:30,7,9
2005-03-02 02:10:00,8,12
2005-03-02 02:10:30,9,13
2005-03-02 02:11:00,10,14'''
df = pd.read_csv(StringIO(y), index_col='Date')
第 2 步:转换为日期时间索引
df.index = pd.to_datetime(df.index)
第 2 步:以 30 秒的持续时间重新采样
new = df.resample('30s').mean()
输出:
A B
Date
2005-03-02 02:05:30 1.0 3.0
2005-03-02 02:06:00 2.0 4.0
2005-03-02 02:06:30 3.0 5.0
2005-03-02 02:07:00 4.0 6.0
2005-03-02 02:07:30 5.0 7.0
2005-03-02 02:08:00 NaN NaN
2005-03-02 02:08:30 7.0 9.0
2005-03-02 02:09:00 7.0 9.0
2005-03-02 02:09:30 7.0 9.0
2005-03-02 02:10:00 8.0 12.0
2005-03-02 02:10:30 9.0 13.0
2005-03-02 02:11:00 10.0 14.0
第 3 步:按 NaN 行拆分数据帧并获取组的 ID
new["group_no"] = new.T.isnull().all().cumsum()
输出:
A B group_no
Date
2005-03-02 02:05:30 1.0 3.0 0
2005-03-02 02:06:00 2.0 4.0 0
2005-03-02 02:06:30 3.0 5.0 0
2005-03-02 02:07:00 4.0 6.0 0
2005-03-02 02:07:30 5.0 7.0 0
2005-03-02 02:08:00 NaN NaN 1
2005-03-02 02:08:30 7.0 9.0 1
2005-03-02 02:09:00 7.0 9.0 1
2005-03-02 02:09:30 7.0 9.0 1
2005-03-02 02:10:00 8.0 12.0 1
2005-03-02 02:10:30 9.0 13.0 1
2005-03-02 02:11:00 10.0 14.0 1
第 4 步:获取每个 group_no 的 B 的平均值
new['Bmean'] = new.groupby('group_no').transform('mean').B
输出:
A B group_no Bmean
Date
2005-03-02 02:05:30 1.0 3.0 0 5.0
2005-03-02 02:06:00 2.0 4.0 0 5.0
2005-03-02 02:06:30 3.0 5.0 0 5.0
2005-03-02 02:07:00 4.0 6.0 0 5.0
2005-03-02 02:07:30 5.0 7.0 0 5.0
2005-03-02 02:08:00 NaN NaN 1 11.0
2005-03-02 02:08:30 7.0 9.0 1 11.0
2005-03-02 02:09:00 7.0 9.0 1 11.0
2005-03-02 02:09:30 7.0 9.0 1 11.0
2005-03-02 02:10:00 8.0 12.0 1 11.0
2005-03-02 02:10:30 9.0 13.0 1 11.0
2005-03-02 02:11:00 10.0 14.0 1 11.0
第 5 步:应用必要的转换并删除额外的列
new['C'] = new['A'] + new['B'] + new['Bmean']
new.drop(['group_no', 'Bmean'], axis=1, inplace=True)
输出:
A B C
Date
2005-03-02 02:05:30 1.0 3.0 9.0
2005-03-02 02:06:00 2.0 4.0 11.0
2005-03-02 02:06:30 3.0 5.0 13.0
2005-03-02 02:07:00 4.0 6.0 15.0
2005-03-02 02:07:30 5.0 7.0 17.0
2005-03-02 02:08:00 NaN NaN NaN
2005-03-02 02:08:30 7.0 9.0 27.0
2005-03-02 02:09:00 7.0 9.0 27.0
2005-03-02 02:09:30 7.0 9.0 27.0
2005-03-02 02:10:00 8.0 12.0 31.0
2005-03-02 02:10:30 9.0 13.0 33.0
2005-03-02 02:11:00 10.0 14.0 35.0