每个示例对具有多个类别的分类特征进行编码

每个示例对具有多个类别的分类特征进行编码

本文介绍了每个示例对具有多个类别的分类特征进行编码-sklearn的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理包含流派作为特征的电影数据集。数据集中的示例可能同时属于多个流派。因此,它们包含一个类型标签列表。

I'm working on a movie dataset which contains genre as a feature. The examples in the dataset may belong to multiple genres at the same time. So, they contain a list of genre labels.

数据看起来像这样-

    movieId                                         genres
0        1  [Adventure, Animation, Children, Comedy, Fantasy]
1        2                     [Adventure, Children, Fantasy]
2        3                                  [Comedy, Romance]
3        4                           [Comedy, Drama, Romance]
4        5                                           [Comedy]

I想要向量化此功能。我尝试了 LabelEncoder OneHotEncoder ,但是它们似乎无法直接处理这些列表。

I want to vectorize this feature. I have tried LabelEncoder and OneHotEncoder, but they can't seem to handle these lists directly.

我可以手动矢量化,但是我有其他相似的功能,其中包含太多类别。对于那些我更喜欢直接使用 FeatureHasher 类的方法。

I could vectorize this manually, but I have other similar features that contain too many categories. For those I'd prefer some way to use the FeatureHasher class directly.

是否有某种方法可以使这些编码器类在此类上工作一项功能?还是有更好的方法来表示这样的功能,从而使编码更容易?

Is there some way to get these encoder classes to work on such a feature? Or is there a better way to represent such a feature that will make encoding easier? I'd gladly welcome any suggestions.

推荐答案

有一些令人印象深刻的答案。在您的示例数据上,Teoretic的最后答案(使用 sklearn.preprocessing.MultiLabelBinarizer )比Paulo Alves的解决方案快14倍(并且两者都比公认的答案快) !):

This SO question has some impressive answers. On your example data, the last answer by Teoretic (using sklearn.preprocessing.MultiLabelBinarizer) is 14 times faster than the solution by Paulo Alves (and both are faster than the accepted answer!):

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)
result = pd.concat([df['movieId'], encoded], axis=1)

# Increase max columns to print the entire resulting DataFrame
pd.options.display.max_columns = 50
result
   movieId  Adventure  Animation  Children  Comedy  Drama  Fantasy  Romance
0        1          1          1         1       1      0        1        0
1        2          1          0         1       0      0        1        0
2        3          0          0         0       1      0        0        1
3        4          0          0         0       1      1        0        1
4        5          0          0         0       1      0        0        0

这篇关于每个示例对具有多个类别的分类特征进行编码-sklearn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:01