Python - SkLearn Imputer 使用

本文介绍了Python - SkLearn Imputer 使用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下问题:我有一个 Pandas 数据框，其中缺失值由字符串 na 标记.我想在其上运行一个 Imputer 以用列中的平均值替换缺失值.根据 sklearn 文档，参数 missing_values 应该可以帮助我:

I have the following question: I have a pandas dataframe, in which missing values are marked by the string na. I want to run an Imputer on it to replace the missing values with the mean in the column. According to the sklearn documentation, the parameter missing_values should help me with this:

missing_values :整数或NaN"，可选(默认=NaN")缺失值的占位符.所有出现的 missing_values将被推算.对于编码为 np.nan 的缺失值，请使用字符串值NaN".

在我看来，这意味着，如果我写

In my understanding, this means, that if I write

df = pd.read_csv(filename)
imp = Imputer(missing_values='na')
imp.fit_transform(df)

这意味着输入器将数据框中的任何内容替换为带有列平均值的 na 值.但是，我收到了一个错误:

that would mean that the imputer replaces anything in the dataframe with the na value with the mean of the column. However, instead, I get an error:

ValueError: could not convert string to float: na

我误解了什么?这不是输入器应该如何工作吗?那么如何用均值替换 na 字符串呢?我应该只使用 lambda 吗?

What am I misinterpreting? Is this not how the imputer should work? How can I replace the na strings with the mean, then? Should I just use a lambda for it?

谢谢！

推荐答案

既然你说你想用列的平均值替换这些 'na'，我猜非-缺失值确实是浮点数.问题是熊猫无法将字符串 'na' 识别为缺失值，因此使用 dtype object 而不是某种风格的 float.

Since you say you want to replace these 'na' by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na' as a missing value, and so reads the column with dtype object instead of some flavor of float.

例如，考虑以下 .csv 文件:

Case in point, consider the following .csv file:

 test.csv

 col1,col2
 1.0,1.0
 2.0,2.0
 3.0,3.0
 na,4.0
 5.0,5.0

通过简单的导入 df = pd.read_csv('test.csv')，df.dtypes 告诉我们 col1 是dtype object 和 col2 是 dtype float64.但是你如何取一堆对象的平均值呢?

With the naive import df = pd.read_csv('test.csv'), df.dtypes tells us that col1 is of dtype object and col2 is of dtype float64. But how do you take the mean of a bunch of objects?

解决方案是告诉 pd.read_csv() 将字符串 'na' 解释为缺失值:

The solution is to tell pd.read_csv() to interpret the string 'na' as a missing value:

df = pd.read_csv('test.csv', na_values='na')

生成的数据框有两列 dtype float64，您现在可以使用您的输入器.

The resulting dataframe has both columns of dtype float64, and you can now use your imputer.

                        这篇关于Python - SkLearn Imputer 使用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！