问题描述
我正在使用PASCAL VOC 2012数据集进行图像分类.一些图像具有多个标签,其中一些图像具有单个标签,如下所示.
I am using PASCAL VOC 2012 dataset for image classification. A few images have multiple labels where as a few of them have single labels as shown below.
0 2007_000027.jpg {'person'}
1 2007_000032.jpg {'aeroplane', 'person'}
2 2007_000033.jpg {'aeroplane'}
3 2007_000039.jpg {'tvmonitor'}
4 2007_000042.jpg {'train'}
我想对这些标签进行一次热编码以训练模型.但是,我不能使用keras.utils.to_categorical,因为这些标签不是整数,而pandas.get_dummies没有给我预期的结果.get_dummies给出了以下不同的类别,即,将标签的每个唯一组合作为一个类别.
I want to do one-hot encoding of these labels to train the model. However, I couldn't use keras.utils.to_categorical, as these labels are not integers and pandas.get_dummies is not giving me the results as expected. get_dummies is giving different categories as below, i.e. it is taking each unique combination of labels as one category.
{'aeroplane', 'bus', 'car'} {'aeroplane', 'bus'} {'tvmonitor', 'sofa'} {'tvmonitor'} ...
对这些标签进行一次热编码的最佳方法是什么,因为我们没有为每个图像指定特定数量的标签.
What is the best way to one-hot encode these labels as we don't have specific number of labels for each image.
推荐答案
如果第二栏中可能有 set
,请使用 MultiLabelBinarizer
:
If there are set
s in second column is possible use MultiLabelBinarizer
:
print (df)
a b
0 2007_000027.jpg {'person'}
1 2007_000032.jpg {'aeroplane', 'person'}
2 2007_000033.jpg {'aeroplane'}
3 2007_000039.jpg {'tvmonitor'}
4 2007_000042.jpg {'train'}
从sklearn.preprocessing导入
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['b']),columns=mlb.classes_)
print (df)
aeroplane person train tvmonitor
0 0 1 0 0
1 1 1 0 0
2 1 0 0 0
3 0 0 0 1
4 0 0 1 0
或 系列.str.join
与 Series.str.get_dummies
,但在大型DataFrame中,它应该更慢:
Or Series.str.join
with Series.str.get_dummies
, but it should be slowier in large DataFrame:
df = df['b'].str.join('|').str.get_dummies()
print (df)
aeroplane person train tvmonitor
0 0 1 0 0
1 1 1 0 0
2 1 0 0 0
3 0 0 0 1
4 0 0 1 0
这篇关于keras中多标签图像的一种热编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!