问题描述
我第一次使用Python Pandas.我有csv格式的5分钟滞后流量数据:
I am using Python Pandas for the first time. I have 5-min lag traffic data in csv format:
...
2015-01-04 08:29:05,271238
2015-01-04 08:34:05,329285
2015-01-04 08:39:05,-1
2015-01-04 08:44:05,260260
2015-01-04 08:49:05,263711
...
有几个问题:
- 在某些时间戳记中,缺少数据(-1)
- 缺少条目(也是连续2/3小时)
- 观察的频率不是精确的5分钟,但实际上偶尔会损失几秒钟
我想获得一个常规的时间序列,因此每5分钟(准确地)输入一次(并且不丢失任何值).我已经使用以下代码成功地对时间序列进行了插值,以使该代码近似为-1值:
I would like to obtain a regular time series, so with entries every (exactly) 5 minutes (and no missing valus). I have successfully interpolated the time series with the following code to approximate the -1 values with this code:
ts = pd.TimeSeries(values, index=timestamps)
ts.interpolate(method='cubic', downcast='infer')
如何既可以插值又可以规范观测的频率?谢谢大家的帮助.
How can I both interpolate and regularize the frequency of the observations? Thank you all for the help.
推荐答案
将-1
更改为NaN:
ts[ts==-1] = np.nan
然后重新采样数据,使其具有5分钟的频率.
Then resample the data to have a 5 minute frequency.
ts = ts.resample('5T')
请注意,默认情况下,如果两次测量均在同一5分钟内进行,则resample
将这些值平均在一起.
Note that, by default, if two measurements fall within the same 5 minute period, resample
averages the values together.
最后,您可以根据时间对时间序列进行线性插值:
Finally, you could linearly interpolate the time series according to the time:
ts = ts.interpolate(method='time')
由于您的数据看起来已经大约有5分钟的频率,因此您可能需要以较短的频率重新采样,以便进行三次或样条插值可以使曲线平滑:
Since it looks like your data already has roughly a 5-minute frequency, youmight need to resample at a shorter frequency so cubic or spline interpolationcan smooth out the curve:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, -1, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:05',
'2015-01-04 08:34:05',
'2015-01-04 08:39:05',
'2015-01-04 08:44:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts[ts==-1] = np.nan
ts = ts.resample('T').mean()
ts.interpolate(method='spline', order=3).plot()
ts.interpolate(method='time').plot()
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['spline', 'time']
plt.legend(lines, labels, loc='best')
plt.show()
这篇关于Python Pandas时间序列内插和正则化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!