问题描述
我目前正在尝试使用熊猫清理并填写一些缺少的时间序列数据.插值函数工作得很好,但是它没有我的数据集所需的一些(使用较少的插值函数).有几个例子是一个简单的最后一个"有效数据点,它将创建类似于阶跃函数的东西,或者诸如对数或几何插值之类的东西.
I am currently trying to clean up and fill in some missing time-series data using pandas. The interpolate function works quite well, however it doesn't have a few (less widely used) interpolation functions that I require for my data set. A couple examples would be a simple "last" valid data point which would create something akin to a step function, or something like a logarithmic or geometric interpolation.
浏览文档时,似乎没有一种方法可以传递自定义插值函数.这样的功能是否直接存在于熊猫中?如果不是,那么有没有人做过pandas-fu来通过其他方式有效应用自定义插值?
Browsing through the docs, it didn't appear there is a way to pass a custom interpolation function. Does such functionality exist directly within pandas? And if not, has anyone done any pandas-fu to efficiently apply custom interpolations through other means?
推荐答案
熊猫提供的插值方法是 scipy.interpolate.interp1d
-不幸的是,它似乎无法以任何方式扩展.我必须做类似的事情才能应用SLERP四元数插值法(使用 numpy-quaternion ),有效地做到这一点.我将在此处复制代码,以希望您可以对其进行调整:
The interpolation methods offered by Pandas are those offered by scipy.interpolate.interp1d
- which, unfortunately, do not seem to be extendable in any way. I had to do something like that to apply SLERP quaternion interpolation (using numpy-quaternion), and I managed to do it quite efficiently. I'll copy the code here in the hope that you can adapt it for your purposes:
def interpolate_slerp(data):
if data.shape[1] != 4:
raise ValueError('Need exactly 4 values for SLERP')
vals = data.values.copy()
# quaternions has size Nx1 (each quaternion is a scalar value)
quaternions = quaternion.as_quat_array(vals)
# This is a mask of the elements that are NaN
empty = np.any(np.isnan(vals), axis=1)
# These are the positions of the valid values
valid_loc = np.argwhere(~empty).squeeze(axis=-1)
# These are the indices (e.g. time) of the valid values
valid_index = data.index[valid_loc].values
# These are the valid values
valid_quaternions = quaternions[valid_loc]
# Positions of the missing values
empty_loc = np.argwhere(empty).squeeze(axis=-1)
# Missing values before first or after last valid are discarded
empty_loc = empty_loc[(empty_loc > valid_loc.min()) & (empty_loc < valid_loc.max())]
# Index value for missing values
empty_index = data.index[empty_loc].values
# Important bit! This tells you the which valid values must be used as interpolation ends for each missing value
interp_loc_end = np.searchsorted(valid_loc, empty_loc)
interp_loc_start = interp_loc_end - 1
# These are the actual values of the interpolation ends
interp_q_start = valid_quaternions[interp_loc_start]
interp_q_end = valid_quaternions[interp_loc_end]
# And these are the indices (e.g. time) of the interpolation ends
interp_t_start = valid_index[interp_loc_start]
interp_t_end = valid_index[interp_loc_end]
# This performs the actual interpolation
# For each missing value, you have:
# * Initial interpolation value
# * Final interpolation value
# * Initial interpolation index
# * Final interpolation index
# * Missing value index
interpolated = quaternion.slerp(interp_q_start, interp_q_end, interp_t_start, interp_t_end, empty_index)
# This puts the interpolated values into place
data = data.copy()
data.iloc[empty_loc] = quaternion.as_float_array(interpolated)
return data
诀窍在np.searchsorted
中,它可以很快找到每个值的正确内插结束点.这种方法的局限性在于:
The trick is in np.searchsorted
, which very quickly finds the right interpolation ends for each value. The limitation of this method is that:
- 您的插值功能必须像
quaternion.slerp
一样工作 (这应该很奇怪,因为它具有常规的ufunc广播行为). - 它仅适用于两端仅需要一个值的插值方法,因此,例如类似于三次插值(您不会因为已经提供了它)之类的东西就行不通了.
- Your interpolation function must work somewhat like
quaternion.slerp
(which should not be strange since it has regular ufunc broadcasting behaviour). - It only works for interpolation methods that require only one value on each end, so if you want e.g. something like a cubic interpolation (which you don't because that one is already provided) this wouldn't work.
这篇关于为 pandas 创建自定义插值函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!