问题描述
我正在处理包含流派作为特征的电影数据集。数据集中的示例可能同时属于多个流派。因此,它们包含一个类型标签列表。
I'm working on a movie dataset which contains genre as a feature. The examples in the dataset may belong to multiple genres at the same time. So, they contain a list of genre labels.
数据看起来像这样-
movieId genres
0 1 [Adventure, Animation, Children, Comedy, Fantasy]
1 2 [Adventure, Children, Fantasy]
2 3 [Comedy, Romance]
3 4 [Comedy, Drama, Romance]
4 5 [Comedy]
I想要向量化此功能。我尝试了 LabelEncoder 和 OneHotEncoder ,但是它们似乎无法直接处理这些列表。
I want to vectorize this feature. I have tried LabelEncoder and OneHotEncoder, but they can't seem to handle these lists directly.
我可以手动矢量化,但是我有其他相似的功能,其中包含太多类别。对于那些我更喜欢直接使用 FeatureHasher 类的方法。
I could vectorize this manually, but I have other similar features that contain too many categories. For those I'd prefer some way to use the FeatureHasher class directly.
是否有某种方法可以使这些编码器类在此类上工作一项功能?还是有更好的方法来表示这样的功能,从而使编码更容易?
Is there some way to get these encoder classes to work on such a feature? Or is there a better way to represent such a feature that will make encoding easier? I'd gladly welcome any suggestions.
推荐答案
有一些令人印象深刻的答案。在您的示例数据上,Teoretic的最后答案(使用 sklearn.preprocessing.MultiLabelBinarizer
)比Paulo Alves的解决方案快14倍(并且两者都比公认的答案快) !):
This SO question has some impressive answers. On your example data, the last answer by Teoretic (using sklearn.preprocessing.MultiLabelBinarizer
) is 14 times faster than the solution by Paulo Alves (and both are faster than the accepted answer!):
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)
result = pd.concat([df['movieId'], encoded], axis=1)
# Increase max columns to print the entire resulting DataFrame
pd.options.display.max_columns = 50
result
movieId Adventure Animation Children Comedy Drama Fantasy Romance
0 1 1 1 1 1 0 1 0
1 2 1 0 1 0 0 1 0
2 3 0 0 0 1 0 0 1
3 4 0 0 0 1 1 0 1
4 5 0 0 0 1 0 0 0
这篇关于每个示例对具有多个类别的分类特征进行编码-sklearn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!