本文介绍了编码类别变量,例如“州名称”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个国家名称的分类列。我不确定必须执行哪种类型的分类编码才能将它们转换为数字类型。

I have a Categorical column with 'State Names'. I'm unsure about which type of Categorical Encoding I'll have to perform in order to convert them to Numeric type.

有83个唯一的州名。

标签编码器用于序数分类变量,但是OneHot会增加列数,因为有83个唯一的州名称。

Label Encoder is used for ordinal categorical variables, but OneHot would increase the number of columns since there are 83 unique State names.

还有其他可以尝试的东西吗?

Is there anything else I can try?

推荐答案

我会使用scikit的OneHotEncoder()或编码设置为 onehot的CategoricalEncoder。它会自动找到每个功能的唯一值并将其处理为一个热向量。它确实增加了该功能的输入维度,但是如果您要进行任何类型的数据科学工作,则这是必需的。如果将特征转换为序数整数(即仅一个整数),而不是二进制值的向量,则算法可能会在两个(可能是完全分离的)分类值之间得出恰好在分类空间中接近的错误结论。

I would use scikit's OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or CategoricalEncoder with encoding set to 'onehot'. It automatically finds the unique values for each feature and processes it into a one hot vector. It does increase the input dimensionality for that feature, but it is necessary if you are doing any type of data science work. If you convert the feature to an ordinal integer (i.e. only one integer) as opposed to a vector of binary values, an algorithm may draw incorrect conclusions between two (possibly completely separate) categorical values that just happen to be close together in the categorical space.

这篇关于编码类别变量,例如“州名称”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-24 15:09