尝试使用python中的pandas查找Excel工作表的异常值。我能够找到第一个和第三个四分位数,但不能在不返回NaN的情况下从另一个中减去一个。

这是基本代码:

absent = pd.read_excel('Absenteeism_at_work.xls')

print("\nOUTLIERS:")
# q1 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.25]))
# q3 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.75]))

#print(absent.loc[:741, 'Distance from Residence to Work'].quantile([0.25])) #quartile

#print(q1)
# q1, q3 = absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
  #                       'Social smoker', 'Social drinker', 'Education']].quantile([0.25 - 0.75])

print(absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
                      'Social smoker', 'Social drinker', 'Education']].quantile([0.75])
   - absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
                    'Social smoker', 'Social drinker', 'Education']].quantile([0.25]))


输出:

OUTLIERS:
      Distance from Residence to Work  Transportation expense  \
0.25                              NaN                     NaN
0.75                              NaN                     NaN

      Month of absence  Social smoker  Social drinker  Education
0.25               NaN            NaN             NaN        NaN
0.75               NaN            NaN             NaN        NaN

最佳答案

您的代码只是简单的四分位数范围计算。如果它可以为您工作,那会很好。如果您需要真正的离群值检测,这比基于四分位数的模式(尤其是多变量)要复杂得多,则可以使用sklearn或pyod之类的python包。
使用分位数功能,您需要清除原始数据以确保它仅是数字。特别是,您导入excel文件作为数据源。
通过检查数据

tmp_df = absent.iloc [:741]

cols = ['从住所到工作的距离,
         “运输费用”,
         “缺席月份”,
         “社交烟民”,
         “社交饮酒者”,
         '教育']

打印(tmp_df [col] .quantile(0.25,0.75))

打印(tmp_df [col] .describe(include ='all'))


祝好运。

怀

10-02 09:24
查看更多