python - 找不到我的数据集的异常值(更具体地说是IQR)

尝试使用python中的pandas查找Excel工作表的异常值。我能够找到第一个和第三个四分位数，但不能在不返回NaN的情况下从另一个中减去一个。

这是基本代码：

absent = pd.read_excel('Absenteeism_at_work.xls')

print("\nOUTLIERS:")
# q1 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.25]))
# q3 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.75]))

#print(absent.loc[:741, 'Distance from Residence to Work'].quantile([0.25])) #quartile

#print(q1)
# q1, q3 = absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
  #                       'Social smoker', 'Social drinker', 'Education']].quantile([0.25 - 0.75])

print(absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
                      'Social smoker', 'Social drinker', 'Education']].quantile([0.75])
   - absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
                    'Social smoker', 'Social drinker', 'Education']].quantile([0.25]))

输出：

OUTLIERS:
      Distance from Residence to Work  Transportation expense  \
0.25                              NaN                     NaN
0.75                              NaN                     NaN

      Month of absence  Social smoker  Social drinker  Education
0.25               NaN            NaN             NaN        NaN
0.75               NaN            NaN             NaN        NaN

最佳答案

您的代码只是简单的四分位数范围计算。如果它可以为您工作，那会很好。如果您需要真正的离群值检测，这比基于四分位数的模式（尤其是多变量）要复杂得多，则可以使用sklearn或pyod之类的python包。
使用分位数功能，您需要清除原始数据以确保它仅是数字。特别是，您导入excel文件作为数据源。
通过检查数据

tmp_df = absent.iloc [：741]

cols = ['从住所到工作的距离，
         “运输费用”，
         “缺席月份”，
         “社交烟民”，
         “社交饮酒者”，
         '教育']

打印（tmp_df [col] .quantile（0.25,0.75））

打印（tmp_df [col] .describe（include ='all'））

祝好运。

怀