检测和排除 pandas 数据帧中的异常值

本文介绍了检测和排除 pandas 数据帧中的异常值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

现在我知道某些行是基于某个列值的异常值。

例如列 - Vol具有12.xx周围的所有值，一个值为4000

现在我想排除这些行有这样的Vol列。

所以本质上我需要放置一个过滤器，以便我们选择所有的行，我们的特定列的值在3个标准偏差之内平均值。

这是一个优雅的方法来实现这一点。

解决方案

使用 boolean 索引，就像在 numpy.array

  df = pd.DataFrame（{'Data'：np.random .normal（size = 200）}）正常分布数据的#example数据集。 
 df [np.abs（df.Data-df.Data.mean（））< =（3 * df.Data.std（））]＃只能在+3到-3之间数据列中的标准偏差。 
 df [〜（np.abs（df.Data-df.Data.mean（））>（3 * df.Data.std（）））] #or如果你喜欢其他方式

对于一个系列，它类似于：

  S = pd.Series（np.random.normal（size = 200））
 S [〜（（SS.mean（））。abs（）> 3 * std（））]

I have a pandas dataframe with few columns.

Now I know that certain rows are outliers based on a certain column value.

For instance columns - 'Vol' has all values around 12.xx and one value which is 4000

Now I would like to exclude those rows that have Vol Column like this.

So essentially I need to put a filter such that we select all rows wehre the values of a certain column are within say 3 standard deviations from mean.

Whats an elegant way to achieve this.

解决方案

Use boolean indexing as you would do in numpy.array

df=pd.DataFrame({'Data':np.random.normal(size=200)})  #example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or if you prefer the other way around

For a series it is similar:

S=pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs()>3*S.std())]

这篇关于检测和排除 pandas 数据帧中的异常值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！