我有一个数据框,如下所示:
Year Month Equipment Weight
2017 1 TennisBall 5
2017 1 Football 4
2017 1 TennisBall 6
2017 1 TennisBall 7
2017 1 TennisBall 300
2017 2 TennisBall 300
2018 2 TennisBall 250
2018 2 Football 5
2018 2 TennisBall 6
2018 2 TennisBall 275
...
在上面的示例中,正常情况下,我们仅在2月份才装运300单位的网球,因此使6单位的订单为异常值,而在一月份,正常数量为〜5,因此在当月的离群值。我想根据每月的重量来剔除异常值。有没有简单的方法可以做到这一点?我知道我可以按照以下方式做一些事情:
df1[np.abs(df1.Weight-df1.Weight.mean()) <= (5*df1.Weight.std())]
抓取重量在平均值5个偏差之内的任何东西,但这不会考虑按月部分,在这里我可以看到由于是哪个月,重量发生了巨大变化。谢谢!
编辑:
例如,所需的输出将如下所示:
Year Month Equipment Weight
2017 1 TennisBall 5
2017 1 Football 4
2017 1 TennisBall 6
2017 1 TennisBall 7
2017 2 TennisBall 300
2018 2 TennisBall 250
2018 2 Football 5
2018 2 TennisBall 275
...
1月份的300的离群值被删除(如1月份的超出正常值),而2月份的6的离群值被删除(1月份处于正常值,但是正如2月份那样,这是不正常的)
最佳答案
这对于groupby是一个问题。您可以通过创建两个包含分组的均值和标准差的新列,然后对这些列进行过滤来解决此问题:
# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')
# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]
# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)
>>> print(df)
0 Year Month Equipment Weight
1 2017 1 TennisBall 5
2 2017 1 Football 4
3 2017 1 TennisBall 6
4 2017 1 TennisBall 7
6 2017 2 TennisBall 300
7 2018 2 TennisBall 250
8 2018 2 Football 5
10 2018 2 TennisBall 275