问题描述
请考虑以下情况:
In [2]: a = pd.Series([1,2,3,4,'.'])
In [3]: a
Out[3]:
0 1
1 2
2 3
3 4
4 .
dtype: object
In [8]: a.astype('float64', raise_on_error = False)
Out[8]:
0 1
1 2
2 3
3 4
4 .
dtype: object
我本来希望有一个选项,该选项允许在将错误值(例如.
)转换为NaN
s时进行转换.有没有办法做到这一点?
I would have expected an option that allows conversion while turning erroneous values (such as that .
) to NaN
s. Is there a way to achieve this?
推荐答案
使用带有errors='coerce'
的pd.to_numeric
# Setup
s = pd.Series(['1', '2', '3', '4', '.'])
s
0 1
1 2
2 3
3 4
4 .
dtype: object
pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
dtype: float64
如果需要填写NaN
,请使用 Series.fillna
.
If you need the NaN
s filled in, use Series.fillna
.
pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')
0 1
1 2
2 3
3 4
4 0
dtype: float64
请注意,downcast='infer'
将尝试在可能的情况下将浮点型转换为整数.如果不想,请删除该参数.
Note, downcast='infer'
will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.
pd.__version__
# '0.24.1'
pd.to_numeric(s, errors='coerce').astype('Int32')
0 1
1 2
2 3
3 4
4 NaN
dtype: Int32
还有其他选项可供选择,请阅读文档以了解更多信息.
There are other options to choose from as well, read the docs for more.
DataFrames
的扩展名如果需要将此扩展到DataFrames,则需要将其 apply 应用于每一行.您可以使用 DataFrame.apply
进行此操作.
Extension for DataFrames
If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply
.
# Setup.
np.random.seed(0)
df = pd.DataFrame({
'A' : np.random.choice(10, 5),
'C' : np.random.choice(10, 5),
'B' : ['1', '###', '...', 50, '234'],
'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df
A B C D
0 5 1 9 23
1 0 ### 3 1
2 3 ... 5 ...
3 3 50 2 268
4 7 234 4 $$
df.dtypes
A int64
B object
C int64
D object
dtype: object
df2 = df.apply(pd.to_numeric, errors='coerce')
df2
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
df2.dtypes
A int64
B float64
C int64
D float64
dtype: object
您也可以使用 DataFrame.transform
;尽管我的测试表明这会稍微慢一些:
You can also do this with DataFrame.transform
; although my tests indicate this is marginally slower:
df.transform(pd.to_numeric, errors='coerce')
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
如果您有许多列(数字;非数字),则可以通过仅在非数字列上应用pd.to_numeric
来提高性能.
If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric
on the non-numeric columns only.
df.dtypes.eq(object)
A False
B True
C False
D True
dtype: bool
cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
# df[c] = pd.to_numeric(df[c], errors='coerce')
df
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
对于长的DataFrame,沿列
应用pd.to_numeric
(即默认为axis=0
)应稍快一些.
Applying pd.to_numeric
along the columns (i.e., axis=0
, the default) should be slightly faster for long DataFrames.
这篇关于将pandas.Series从dtype对象转换为float,将错误转换为nans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!