问题描述
现在我知道某些行是基于某个列值的异常值。
例如列 - Vol具有12.xx周围的所有值,一个值为4000
现在我想排除这些行有这样的Vol列。
所以本质上我需要放置一个过滤器,以便我们选择所有的行,我们的特定列的值在3个标准偏差之内平均值。
这是一个优雅的方法来实现这一点。
使用 boolean
索引,就像在 numpy.array
df = pd.DataFrame({'Data':np.random .normal(size = 200)})正常分布数据的#example数据集。
df [np.abs(df.Data-df.Data.mean())< =(3 * df.Data.std())]#只能在+3到-3之间数据列中的标准偏差。
df [〜(np.abs(df.Data-df.Data.mean())>(3 * df.Data.std()))] #or如果你喜欢其他方式
对于一个系列,它类似于:
S = pd.Series(np.random.normal(size = 200))
S [〜((SS.mean())。abs()> 3 * std())]
I have a pandas dataframe with few columns.
Now I know that certain rows are outliers based on a certain column value.
For instance columns - 'Vol' has all values around 12.xx and one value which is 4000
Now I would like to exclude those rows that have Vol Column like this.
So essentially I need to put a filter such that we select all rows wehre the values of a certain column are within say 3 standard deviations from mean.
Whats an elegant way to achieve this.
Use boolean
indexing as you would do in numpy.array
df=pd.DataFrame({'Data':np.random.normal(size=200)}) #example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or if you prefer the other way around
For a series it is similar:
S=pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs()>3*S.std())]
这篇关于检测和排除 pandas 数据帧中的异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!