我有一个pandas数据框,其中有几个速度值是连续移动的,但是它是一个传感器数据,所以我们经常会得到中间的一些点的误差,移动平均值似乎也没有帮助,所以我可以使用什么方法来从数据中删除这些异常值或峰值点?
例子:
data points = {0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}
在这些数据中,如果我看到点4,4,5,6完全是异常值,
在我使用带有5分钟窗框的滚动平均值来平滑这些值之前,但我仍然得到了许多类型的点,我想删除这些点,有人可以建议我任何技术来消除这些点。
我有一个更清晰的数据视图:
如果你看到这里的数据是如何显示一些离群点,我必须删除?
你知道怎样才能消除这些问题吗?
最佳答案
我真的认为使用scipy.stats.zscore()的z-score是一种方法。查看this post中的相关问题。在去除潜在的异常值之前,他们主要关注使用哪种方法。在我看来,您的挑战有点简单,因为根据所提供的数据判断,在不必转换数据的情况下,识别潜在的异常值将是非常直接的。下面是一个代码片段,可以做到这一点。不过,请记住,什么是异常值,什么不是异常值将完全取决于您的数据集。在删除了一些异常值之后,以前看起来不像是异常值的东西,现在突然就会这样做了。看看:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
原始数据:
测试运行1:Z分数=4:
如您所见,由于级别设置得太高,没有删除任何数据。
测试运行2:Z分数=2:
现在我们有进展了。已经删除了两个异常值,但仍有一些可疑的数据。
测试运行3:Z分数=1.2:
这看起来真不错。现在剩下的数据似乎比以前更均匀地分布了。但是现在由原始数据点高亮显示的数据点看起来有点像潜在的离群点。那该停在哪里呢?这完全取决于你!
编辑:以下是一个简单的复制粘贴:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)
# final plot
df_clean.plot(style = 'o')