我有一个数据框,如下所示:

Year Month Equipment   Weight
2017 1     TennisBall  5
2017 1     Football    4
2017 1     TennisBall  6
2017 1     TennisBall  7
2017 1     TennisBall  300
2017 2     TennisBall  300
2018 2     TennisBall  250
2018 2     Football    5
2018 2     TennisBall  6
2018 2     TennisBall  275
...


在上面的示例中,正常情况下,我们仅在2月份才装运300单位的网球,因此使6单位的订单为异常值,而在一月​​份,正常数量为〜5,因此在当月的离群值。我想根据每月的重量来剔除异常值。有没有简单的方法可以做到这一点?我知道我可以按照以下方式做一些事情:

df1[np.abs(df1.Weight-df1.Weight.mean()) <= (5*df1.Weight.std())]


抓取重量在平均值5个偏差之内的任何东西,但这不会考虑按月部分,在这里我可以看到由于是哪个月,重量发生了巨大变化。谢谢!

编辑:
例如,所需的输出将如下所示:

Year Month Equipment   Weight
2017 1     TennisBall  5
2017 1     Football    4
2017 1     TennisBall  6
2017 1     TennisBall  7

2017 2     TennisBall  300
2018 2     TennisBall  250
2018 2     Football    5

2018 2     TennisBall  275
...


1月份的300的离群值被删除(如1月份的超出正常值),而2月份的6的离群值被删除(1月份处于正常值,但是正如2月份那样,这是不正常的)

最佳答案

这对于groupby是一个问题。您可以通过创建两个包含分组的均值和标准差的新列,然后对这些列进行过滤来解决此问题:

# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')

# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]

# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)




>>> print(df)

0   Year Month   Equipment  Weight
1   2017     1  TennisBall       5
2   2017     1    Football       4
3   2017     1  TennisBall       6
4   2017     1  TennisBall       7
6   2017     2  TennisBall     300
7   2018     2  TennisBall     250
8   2018     2    Football       5
10  2018     2  TennisBall     275

08-17 11:57