本文介绍了如何对一个变长特征进行热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
给出可变长度特征的列表:
Given a list of variant length features:
features = [
['f1', 'f2', 'f3'],
['f2', 'f4', 'f5', 'f6'],
['f1', 'f2']
]
每个样本具有不同数量的特征,特征dtype
是str
,并且已经很热.
where each sample has variant number of features and the feature dtype
is str
and already one hot.
为了使用sklearn的特征选择实用程序,我必须将features
转换为二维数组,如下所示:
In order to use feature selection utilities of sklearn, I have to convert the features
to a 2D-array which looks like:
f1 f2 f3 f4 f5 f6
s1 1 1 1 0 0 0
s2 0 1 0 1 1 1
s3 1 1 0 0 0 0
如何通过sklearn或numpy实现它?
How could I achieve it via sklearn or numpy?
推荐答案
您可以使用 MultiLabelBinarizer 存在于scikit中,专门用于执行此操作.
You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.
您的示例代码:
features = [
['f1', 'f2', 'f3'],
['f2', 'f4', 'f5', 'f6'],
['f1', 'f2']
]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)
输出:
array([[1, 1, 1, 0, 0, 0],
[0, 1, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 0]])
它还可以与其他feature_selection实用程序一起在管道中使用.
This can also be used in a pipeline, along with other feature_selection utilities.
这篇关于如何对一个变长特征进行热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!