问题描述
使用scipy.stats.linregress,我对一些高度相关的x,y实验数据集执行简单的线性回归,并开始目视检查每个x,y散点图是否存在异常值.更一般而言(即以编程方式),有没有办法识别和掩盖异常值?
With scipy.stats.linregress I am performing a simple linear regression on some sets of highly correlated x,y experimental data, and initially visually inspecting each x,y scatter plot for outliers. More generally (i.e. programmatically) is there a way to identify and mask outliers?
推荐答案
statsmodels
软件包具有您所需要的.看一下这个小代码片段及其输出:
The statsmodels
package has what you need. Look at this little code snippet and its output:
# Imports #
import statsmodels.api as smapi
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*10 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make graph #
regression = smapi.OLS(x, y).fit()
figure = smgraphics.regressionplots.plot_fit(regression, 0)
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5)
print 'Outliers: ', list(outliers)
Outliers: [(15, 220)]
使用statsmodels
的较新版本,情况有所变化.这是一个显示相同类型的异常值检测的新代码段.
With the newer version of statsmodels
, things have changed a bit. Here is a new code snippet that shows the same type of outlier detection.
# Imports #
from random import random
import statsmodels.api as smapi
from statsmodels.formula.api import ols
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*(10+random())+200 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make fit #
regression = ols("data ~ x", data=dict(data=y, x=x)).fit()
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5)
print 'Outliers: ', list(outliers)
# Figure #
figure = smgraphics.regressionplots.plot_fit(regression, 1)
# Add line #
smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])
Outliers: [(15, 220)]
这篇关于scipy.stats可以识别并掩盖明显的异常值吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!