问题描述
我在 Python 笔记本中有三列的数据集.似乎 1.5 倍 IQR 中有太多异常值.我在想如何计算所有列的异常值?
如果异常值太多,我可能会考虑删除被视为多个特征的异常值的点.如果是这样,我怎么能这样算?
谢谢!
类似于
注意总和之前的部分 ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
) 是一个布尔掩码,所以你可以直接使用它来去除异常值.这将它们设置为 NaN,例如:
mask = (df (Q3 + 1.5 * IQR))df[掩码] = np.nan
I have dataset with three columns in Python notebook. It seems there are too many outliers out of 1.5 times IQR. I'm think how can I count the outliers for all columns?
If there are too many outliers, I may consider to remove the points considered as outliers for more than one feature. If so, how I can count it in that way?
Thanks!
Similar to Romain X.'s answer but operates on the DataFrame instead of Series.
Random data:
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 5), columns=list('ABCDE'))
df.iloc[::10] += np.random.randn() * 2 # this hopefully introduces some outliers
df.head()
Out:
A B C D E
0 2.529517 1.165622 1.744203 3.006358 2.633023
1 -0.977278 0.950088 -0.151357 -0.103219 0.410599
2 0.144044 1.454274 0.761038 0.121675 0.443863
3 0.333674 1.494079 -0.205158 0.313068 -0.854096
4 -2.552990 0.653619 0.864436 -0.742165 2.269755
Quartile calculations:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
And these are the numbers for each column:
((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
Out:
A 1
B 0
C 0
D 1
E 2
dtype: int64
In line with seaborn's calculations:
Note that the part before the sum ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
) is a boolean mask so you can use it directly to remove outliers. This sets them to NaN, for example:
mask = (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
df[mask] = np.nan
这篇关于如何计算 Python 中所有列的异常值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!