本文介绍了 pandas 分组删除异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想按组明智地删除基于百分位数99值的离群值.

I want to remove outliers based on percentile 99 values by group wise.

 import pandas as pd
 df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})

在输出中,我想从A组中删除11.2,从B组中删除100.因此最终数据集中只有5个观测值.

in output i want to remove 11.2 from group A and 100 from group b. so in final dataset there will only be 5 observations.

wantdf = pd.DataFrame({'Group': ['A','A','B','B','B'], 'count': [1.1,1.1,3.3,3.40,3.3]})

我已经尝试过了,但是没有得到想要的结果

I have tried this one but I'm not getting the desired results

df[df.groupby("Group")['count'].transform(lambda x : (x<x.quantile(0.99))&(x>(x.quantile(0.01)))).eq(1)]

推荐答案

我不希望使用分位数,因为您将排除较低的值:

I don't think you want to use quantile, as you'll exclude your lower values:

import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
print(pd.DataFrame(df.groupby('Group').quantile(.01)['count']))

输出:

       count
Group
A        1.1
B        3.3

不是异常值,对吧?因此,您不想将它们排除在外.

Those aren't outliers, right? So you wouldn't want to exclude them.

您可以尝试通过使用与中位数之间的标准偏差来设置左右极限吗?这有点冗长,但是它为您提供了正确的答案:

You could try setting left and right limits by using standard deviations from the median maybe? This is a bit verbose, but it gives you the right answer:

left = pd.DataFrame(df.groupby('Group').median() - pd.DataFrame(df.groupby('Group').std()))
right = pd.DataFrame(df.groupby('Group').median() + pd.DataFrame(df.groupby('Group').std()))

left.columns = ['left']
right.columns = ['right']

df = df.merge(left, left_on='Group', right_index=True)
df = df.merge(right, left_on='Group', right_index=True)

df = df[(df['count'] > df['left']) & (df['count'] < df['right'])]
df = df.drop(['left', 'right'], axis=1)
print(df)

输出:

  Group  count
0     A    1.1
2     A    1.1
3     B    3.3
4     B    3.4
5     B    3.3

这篇关于 pandas 分组删除异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 20:17