更改Pandas中列的数据类型

更改Pandas中列的数据类型

本文介绍了更改Pandas中列的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将表示为列表列表的表转换为Pandas DataFrame.作为一个极其简化的示例:

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

将列转换为适当类型的最佳方法是什么,在这种情况下,将第2列和第3列转换为浮点数?有没有一种方法可以在转换为DataFrame时指定类型?还是先创建DataFrame然后遍历各列以更改各列的类型更好?理想情况下,我想以动态方式执行此操作,因为可以有数百个列,并且我不想确切指定哪些列属于哪种类型.我可以保证的是,每一列都包含相同类型的值.

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.

推荐答案

您有三种主要的转换熊猫类型的方法:

You have three main options for converting types in pandas:

  1. to_numeric() -提供安全地将非数字类型(例如字符串)转换为合适的数字类型的功能. (另请参见 to_datetime() to_timedelta() .)

  1. to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)

astype() -将(几乎)任何类型转换为(几乎)任何其他类型(即使这样做不一定明智).还允许您转换为类别类型(非常有用).

astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

infer_objects() -一种实用的方法,可以将保存Python对象的对象列转换为熊猫类型.

infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.

请继续阅读以了解每种方法的更详细说明和用法.

Read on for more detailed explanations and usage of each of these methods.

将DataFrame的一个或多个列转换为数值的最佳方法是使用 pandas.to_numeric() .

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().

此函数将尝试将非数字对象(例如字符串)适当地更改为整数或浮点数.

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

to_numeric()的输入是DataFrame的Series或单个列.

The input to to_numeric() is a Series or a single column of a DataFrame.

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

如您所见,将返回一个新系列.请记住,将此输出分配给变量或列名以继续使用它:

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

您还可以通过apply()方法使用它来转换DataFrame的多个列:

You can also use it to convert multiple columns of a DataFrame via the apply() method:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

只要您的值都可以转换,那可能就是您所需要的.

As long as your values can all be converted, that's probably all you need.

但是如果某些值不能转换为数字类型怎么办?

But what if some values can't be converted to a numeric type?

to_numeric()还采用了errors关键字参数,该参数允许您将非数字值强制为NaN,或者只是忽略包含这些值的列.

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.

这是使用一系列字符串dc对象的示例s:

Here's an example using a Series of strings s which has the object dtype:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

默认行为是无法转换值时引发.在这种情况下,它不能处理字符串"pandas":

The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

我们可能希望将"pandas"视为丢失/错误的数值,而不是失败.我们可以使用errors关键字参数将无效值强制为NaN:

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

errors的第三个选项只是在遇到无效值时忽略该操作:

The third option for errors is just to ignore the operation if an invalid value is encountered:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

当您要转换整个DataFrame但不知道我们哪些列可以可靠地转换为数字类型时,最后一个选项特别有用.在这种情况下,只需写:

This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type. In that case just write:

df.apply(pd.to_numeric, errors='ignore')

该函数将应用于DataFrame的每一列.可以转换为数字类型的列将被转换,而不能转换为数字类型的列(例如,它们包含非数字字符串或日期)将被保留.

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

默认情况下,使用to_numeric()进行转换将为您提供int64float64 dtype(或平台固有的整数宽度).

By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform).

通常这就是您想要的,但是如果您想节省一些内存并使用更紧凑的dtype(例如float32int8)呢?

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?

to_numeric()使您可以选择向下转换为'integer','signed','unsigned','float'.这是一个简单的整数类型s系列的示例:

to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

向下转换为'integer'会使用可以保存值的最小整数:

Downcasting to 'integer' uses the smallest possible integer that can hold the values:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

向下转换为'float'同样会选择比普通浮动类型小的

Downcasting to 'float' similarly picks a smaller than normal floating type:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32


2. astype()

astype() 方法启用明确说明您希望DataFrame或Series拥有的dtype.它的用途非常广泛,您可以尝试从一种类型转换为另一种类型.


2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to the any other.

只需选择一个类型:您可以使用NumPy dtype(例如np.int16),某些Python类型(例如bool)或特定于熊猫的类型(例如类别dtype).

Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

在要转换的对象上调用方法,astype()会尝试为您转换:

Call the method on the object you want to convert and astype() will try and convert it for you:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

注意,我说尝试"-如果astype()不知道如何转换Series或DataFrame中的值,它将引发错误.例如,如果您具有NaNinf值,则尝试将其转换为整数时会出错.

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a NaN or inf value you'll get an error trying to convert it to an integer.

从熊猫0.20.0开始,可以通过传递errors='ignore'来抑制此错误.您的原始对象将保持原样返回.

As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be return untouched.

astype()功能强大,但有时会不正确地"转换值.例如:

astype() is powerful, but it will sometimes convert values "incorrectly". For example:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

这些都是小整数,那么如何转换为无符号8位类型以节省内存呢?

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

转换成功了,但-7被换成了249(即2 -7)!

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 2 - 7)!

尝试使用pd.to_numeric(s, downcast='unsigned')进行下播可以帮助防止该错误.

Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error.

pandas的0.21.0版本引入了 infer_objects() 用于将具有对象数据类型的DataFrame列转换为更特定的类型(软转换).

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

例如,这是一个具有两列对象类型的DataFrame.一个保存实际的整数,另一个保存代表整数的字符串:

For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

使用infer_objects(),您可以将列"a"的类型更改为int64:

Using infer_objects(), you can change the type of column 'a' to int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

列'b'的值是字符串而不是整数,因此已被保留.如果要尝试强制将两列都转换为整数类型,则可以改用df.astype(int).

Column 'b' has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead.

这篇关于更改Pandas中列的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 23:28