问题描述
在Pandas DataFrame中内插NaN单元非常容易:
It's very easy to interpolate NaN cells in a Pandas DataFrame:
In [98]: df
Out[98]:
neg neu pos avg
250 0.508475 0.527027 0.641292 0.558931
500 NaN NaN NaN NaN
1000 0.650000 0.571429 0.653983 0.625137
2000 NaN NaN NaN NaN
3000 0.619718 0.663158 0.665468 0.649448
4000 NaN NaN NaN NaN
6000 NaN NaN NaN NaN
8000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
20000 NaN NaN NaN NaN
30000 NaN NaN NaN NaN
50000 NaN NaN NaN NaN
[12 rows x 4 columns]
In [99]: df.interpolate(method='nearest', axis=0)
Out[99]:
neg neu pos avg
250 0.508475 0.527027 0.641292 0.558931
500 0.508475 0.527027 0.641292 0.558931
1000 0.650000 0.571429 0.653983 0.625137
2000 0.650000 0.571429 0.653983 0.625137
3000 0.619718 0.663158 0.665468 0.649448
4000 NaN NaN NaN NaN
6000 NaN NaN NaN NaN
8000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
20000 NaN NaN NaN NaN
30000 NaN NaN NaN NaN
50000 NaN NaN NaN NaN
[12 rows x 4 columns]
我还希望它使用给定的方法推断插值范围之外的NaN值.我该怎么做呢?
I would also want it to extrapolate the NaN values that are outside of the interpolation scope, using the given method. How could I best do this?
推荐答案
外推熊猫DataFrame
s
DataFrame
可能是外推的,但是,pandas中没有简单的方法调用,并且需要另一个库(例如 scipy.optimize ).
Extrapolating Pandas DataFrame
s
DataFrame
s maybe be extrapolated, however, there is not a simple method call within pandas and requires another library (e.g. scipy.optimize).
一般而言,外推法要求人们做出某些关于数据的假设 .一种方法是通过曲线拟合一些通用的参数化方程到数据,以找到最能描述该参数的参数值.现有数据,然后用于计算超出此数据范围的值.这种方法的困难和局限性问题是,在选择参数化方程式时,必须对趋势做出一些假设.可以通过使用不同方程式的反复试验来找到所需的结果,或者有时可以从数据源中推断出来.问题中提供的数据实际上不足以获取良好拟合曲线的数据集;但是,它足以说明问题.
Extrapolating, in general, requires one to make certain assumptions about the data being extrapolated. One way is by curve fitting some general parameterized equation to the data to find parameter values that best describe the existing data, which is then used to calculate values that extend beyond the range of this data. The difficult and limiting issue with this approach is that some assumption about trend must be made when the parameterized equation is selected. This can be found thru trial and error with different equations to give the desired result or it can sometimes be inferred from the source of the data. The data provided in the question is really not large enough of a dataset to obtain a well fit curve; however, it is good enough for illustration.
下面是使用3 阶多项式外推DataFrame
的示例
The following is an example of extrapolating the DataFrame
with a 3 order polynomial
此通用函数(func()
)可以通过曲线拟合到每一列上,以获得唯一的列特定参数(即 a , b , c , d ).然后,将这些参数化的方程式用于对所有具有NaN
s的索引的每一列中的数据进行推断.
This generic function (func()
) is curve fit onto each column to obtain unique column specific parameters (i.e. a, b, c, d). Then these parameterized equations are used to extrapolate the data in each column for all the indexes with NaN
s.
import pandas as pd
from cStringIO import StringIO
from scipy.optimize import curve_fit
df = pd.read_table(StringIO('''
neg neu pos avg
0 NaN NaN NaN NaN
250 0.508475 0.527027 0.641292 0.558931
500 NaN NaN NaN NaN
1000 0.650000 0.571429 0.653983 0.625137
2000 NaN NaN NaN NaN
3000 0.619718 0.663158 0.665468 0.649448
4000 NaN NaN NaN NaN
6000 NaN NaN NaN NaN
8000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
20000 NaN NaN NaN NaN
30000 NaN NaN NaN NaN
50000 NaN NaN NaN NaN'''), sep='\s+')
# Do the original interpolation
df.interpolate(method='nearest', xis=0, inplace=True)
# Display result
print ('Interpolated data:')
print (df)
print ()
# Function to curve fit to the data
def func(x, a, b, c, d):
return a * (x ** 3) + b * (x ** 2) + c * x + d
# Initial parameter guess, just to kick off the optimization
guess = (0.5, 0.5, 0.5, 0.5)
# Create copy of data to remove NaNs for curve fitting
fit_df = df.dropna()
# Place to store function parameters for each column
col_params = {}
# Curve fit each column
for col in fit_df.columns:
# Get x & y
x = fit_df.index.astype(float).values
y = fit_df[col].values
# Curve fit column and get curve parameters
params = curve_fit(func, x, y, guess)
# Store optimized parameters
col_params[col] = params[0]
# Extrapolate each column
for col in df.columns:
# Get the index values for NaNs in the column
x = df[pd.isnull(df[col])].index.astype(float).values
# Extrapolate those points with the fitted function
df[col][x] = func(x, *col_params[col])
# Display result
print ('Extrapolated data:')
print (df)
print ()
print ('Data was extrapolated with these column functions:')
for col in col_params:
print ('f_{}(x) = {:0.3e} x^3 + {:0.3e} x^2 + {:0.4f} x + {:0.4f}'.format(col, *col_params[col]))
外推结果
Interpolated data:
neg neu pos avg
0 NaN NaN NaN NaN
250 0.508475 0.527027 0.641292 0.558931
500 0.508475 0.527027 0.641292 0.558931
1000 0.650000 0.571429 0.653983 0.625137
2000 0.650000 0.571429 0.653983 0.625137
3000 0.619718 0.663158 0.665468 0.649448
4000 NaN NaN NaN NaN
6000 NaN NaN NaN NaN
8000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
20000 NaN NaN NaN NaN
30000 NaN NaN NaN NaN
50000 NaN NaN NaN NaN
Extrapolated data:
neg neu pos avg
0 0.411206 0.486983 0.631233 0.509807
250 0.508475 0.527027 0.641292 0.558931
500 0.508475 0.527027 0.641292 0.558931
1000 0.650000 0.571429 0.653983 0.625137
2000 0.650000 0.571429 0.653983 0.625137
3000 0.619718 0.663158 0.665468 0.649448
4000 0.621036 0.969232 0.708464 0.766245
6000 1.197762 2.799529 0.991552 1.662954
8000 3.281869 7.191776 1.702860 4.058855
10000 7.767992 15.272849 3.041316 8.694096
20000 97.540944 150.451269 26.103320 91.365599
30000 381.559069 546.881749 94.683310 341.042883
50000 1979.646859 2686.936912 467.861511 1711.489069
Data was extrapolated with these column functions:
f_neg(x) = 1.864e-11 x^3 + -1.471e-07 x^2 + 0.0003 x + 0.4112
f_neu(x) = 2.348e-11 x^3 + -1.023e-07 x^2 + 0.0002 x + 0.4870
f_avg(x) = 1.542e-11 x^3 + -9.016e-08 x^2 + 0.0002 x + 0.5098
f_pos(x) = 4.144e-12 x^3 + -2.107e-08 x^2 + 0.0000 x + 0.6312
avg
列的图
Plot for avg
column
如果没有更大的数据集或不知道数据源,则此结果可能完全错误,但应举例说明推断DataFrame
的过程. func()
中的假定方程可能需要进行播放才能获得正确的外推法.另外,也没有尝试使代码高效.
Without a larger dataset or knowing the source of the data, this result maybe completely wrong, but should exemplify the process to extrapolate a DataFrame
. The assumed equation in func()
would probably need to be played with to get the correct extrapolation. Also, no attempt to make the code efficient was made.
更新:
如果您的索引是非数字索引(例如DatetimeIndex
),请查看此答案以了解如何对其进行推断.
If your index is non-numeric, like a DatetimeIndex
, see this answer for how to extrapolate them.
这篇关于推断Pandas DataFrame中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!