在numpy数组上应用onehotencoder

本文介绍了在numpy数组上应用onehotencoder的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在numpy数组上应用OneHotEncoder.

I am applying OneHotEncoder on numpy array.

这是代码

print X.shape, test_data.shape #gives 4100, 15) (410, 15)
onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
X = onehotencoder_1.fit_transform(X).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
test_data = onehotencoder_2.fit_transform(test_data).toarray()

print X.shape, test_data.shape #gives (4100, 46) (410, 43)

其中X和test_data均为<type 'numpy.ndarray'>

X是我的火车，而test_data是我的测试.

X is my train set while test_data my test set.

为什么没有. X&的列不同test_data.在应用onehotencoder后，两者均应为 46 或 43 .

How come the no. of columns different for X & test_data. they should be 46 or either 43 for both after applying onehotencoder.

我正在对特定属性应用OnehotEncoder，因为它们在X和test_data

I am applying OnehotEncoder on specific attributes as they are categorical in nature in both X and test_data

有人可以指出这里有什么问题吗?

Can someone point out what is wrong here?

推荐答案

请勿在test_data上使用新的OneHotEncoder，请使用第一个，并且仅在其上使用transform().这样做:

Don't use a new OneHotEncoder on test_data, use the first one, and only use transform() on it. Do this:

test_data = onehotencoder_1.transform(test_data).toarray()

切勿在测试数据上使用fit()(或fit_transform()).

Never use fit() (or fit_transform()) on testing data.

完全有可能使用不同数量的列，因为可能会发生测试数据不包含列车数据中存在的某些类别的情况.因此，当您使用新的OneHotEncoder并在其上调用fit()(或fit_transform())时，它将仅了解test_data中存在的类别.因此各列之间会有差异.

The different number of columns are entirely possible because it may happen that test data dont contain some categories which are present in train data. So when you use a new OneHotEncoder and call fit() (or fit_transform()) on it, it will only learn about categories present in test_data. So there will be difference between the columns.

这篇关于在numpy数组上应用onehotencoder的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！