尝试使用python中的pandas查找Excel工作表的异常值。我能够找到第一个和第三个四分位数,但不能在不返回NaN
的情况下从另一个中减去一个。
这是基本代码:
absent = pd.read_excel('Absenteeism_at_work.xls')
print("\nOUTLIERS:")
# q1 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.25]))
# q3 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.75]))
#print(absent.loc[:741, 'Distance from Residence to Work'].quantile([0.25])) #quartile
#print(q1)
# q1, q3 = absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
# 'Social smoker', 'Social drinker', 'Education']].quantile([0.25 - 0.75])
print(absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
'Social smoker', 'Social drinker', 'Education']].quantile([0.75])
- absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
'Social smoker', 'Social drinker', 'Education']].quantile([0.25]))
输出:
OUTLIERS:
Distance from Residence to Work Transportation expense \
0.25 NaN NaN
0.75 NaN NaN
Month of absence Social smoker Social drinker Education
0.25 NaN NaN NaN NaN
0.75 NaN NaN NaN NaN
最佳答案
您的代码只是简单的四分位数范围计算。如果它可以为您工作,那会很好。如果您需要真正的离群值检测,这比基于四分位数的模式(尤其是多变量)要复杂得多,则可以使用sklearn或pyod之类的python包。
使用分位数功能,您需要清除原始数据以确保它仅是数字。特别是,您导入excel文件作为数据源。
通过检查数据
tmp_df = absent.iloc [:741]
cols = ['从住所到工作的距离,
“运输费用”,
“缺席月份”,
“社交烟民”,
“社交饮酒者”,
'教育']
打印(tmp_df [col] .quantile(0.25,0.75))
打印(tmp_df [col] .describe(include ='all'))
祝好运。
怀