问题描述
我正在numpy数组上应用OneHotEncoder.
I am applying OneHotEncoder on numpy array.
这是代码
print X.shape, test_data.shape #gives 4100, 15) (410, 15)
onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
X = onehotencoder_1.fit_transform(X).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
test_data = onehotencoder_2.fit_transform(test_data).toarray()
print X.shape, test_data.shape #gives (4100, 46) (410, 43)
其中X
和test_data
均为<type 'numpy.ndarray'>
X
是我的火车,而test_data
是我的测试.
X
is my train set while test_data
my test set.
为什么没有. X
&的列不同test_data
.在应用onehotencoder后,两者均应为 46 或 43 .
How come the no. of columns different for X
& test_data
. they should be 46 or either 43 for both after applying onehotencoder.
我正在对特定属性应用OnehotEncoder,因为它们在X
和test_data
I am applying OnehotEncoder on specific attributes as they are categorical in nature in both X
and test_data
有人可以指出这里有什么问题吗?
Can someone point out what is wrong here?
推荐答案
请勿在test_data
上使用新的OneHotEncoder,请使用第一个,并且仅在其上使用transform()
.这样做:
Don't use a new OneHotEncoder on test_data
, use the first one, and only use transform()
on it. Do this:
test_data = onehotencoder_1.transform(test_data).toarray()
切勿在测试数据上使用fit()
(或fit_transform()
).
Never use fit()
(or fit_transform()
) on testing data.
完全有可能使用不同数量的列,因为可能会发生测试数据不包含列车数据中存在的某些类别的情况.因此,当您使用新的OneHotEncoder并在其上调用fit()
(或fit_transform()
)时,它将仅了解test_data
中存在的类别.因此各列之间会有差异.
The different number of columns are entirely possible because it may happen that test data dont contain some categories which are present in train data. So when you use a new OneHotEncoder and call fit()
(or fit_transform()
) on it, it will only learn about categories present in test_data
. So there will be difference between the columns.
这篇关于在numpy数组上应用onehotencoder的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!