问题描述
我有以下问题:我有一个 Pandas 数据框,其中缺失值由字符串 na
标记.我想在其上运行一个 Imputer 以用列中的平均值替换缺失值.根据 sklearn 文档,参数 missing_values
应该可以帮助我:
I have the following question: I have a pandas dataframe, in which missing values are marked by the string na
. I want to run an Imputer on it to replace the missing values with the mean in the column. According to the sklearn documentation, the parameter missing_values
should help me with this:
missing_values :整数或NaN",可选(默认=NaN")缺失值的占位符.所有出现的 missing_values将被推算.对于编码为 np.nan 的缺失值,请使用字符串值NaN".
在我看来,这意味着,如果我写
In my understanding, this means, that if I write
df = pd.read_csv(filename)
imp = Imputer(missing_values='na')
imp.fit_transform(df)
这意味着输入器将数据框中的任何内容替换为带有列平均值的 na
值.但是,我收到了一个错误:
that would mean that the imputer replaces anything in the dataframe with the na
value with the mean of the column. However, instead, I get an error:
ValueError: could not convert string to float: na
我误解了什么?这不是输入器应该如何工作吗?那么如何用均值替换 na
字符串呢?我应该只使用 lambda 吗?
What am I misinterpreting? Is this not how the imputer should work? How can I replace the na
strings with the mean, then? Should I just use a lambda for it?
谢谢!
推荐答案
既然你说你想用列的平均值替换这些 'na'
,我猜非-缺失值确实是浮点数.问题是熊猫无法将字符串 'na'
识别为缺失值,因此使用 dtype object
而不是某种风格的 float.
Since you say you want to replace these
'na'
by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na'
as a missing value, and so reads the column with dtype object
instead of some flavor of float
.
例如,考虑以下
.csv
文件:
Case in point, consider the following
.csv
file:
test.csv
col1,col2
1.0,1.0
2.0,2.0
3.0,3.0
na,4.0
5.0,5.0
通过简单的导入
df = pd.read_csv('test.csv')
,df.dtypes
告诉我们 col1
是dtype object
和 col2
是 dtype float64
.但是你如何取一堆对象的平均值呢?
With the naive import
df = pd.read_csv('test.csv')
, df.dtypes
tells us that col1
is of dtype object
and col2
is of dtype float64
. But how do you take the mean of a bunch of objects?
解决方案是告诉
pd.read_csv()
将字符串 'na'
解释为缺失值:
The solution is to tell
pd.read_csv()
to interpret the string 'na'
as a missing value:
df = pd.read_csv('test.csv', na_values='na')
生成的数据框有两列 dtype
float64
,您现在可以使用您的输入器.
The resulting dataframe has both columns of dtype
float64
, and you can now use your imputer.
这篇关于Python - SkLearn Imputer 使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!