在多个程序中正确使用Scikit的LabelEncoder

本文介绍了在多个程序中正确使用Scikit的LabelEncoder的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我手头的基本任务是

a)读取一些制表符分隔的数据.

a) Read some tab separated data.

b)做一些基本的预处理

b) Do some basic preprocessing

c)对于每个类别列，请使用LabelEncoder创建映射.这有点像这样

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={}
#Converting Categorical Data
for x in categorical_list:
     mapper[x]=preprocessing.LabelEncoder()

for x in categorical_list:
     df[x]=mapper[x].fit_transform(df.__getattr__(x))

其中df是熊猫数据框，而categorical_list是需要转换的列标题的列表.

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d)训练分类器，并使用pickle

d) Train a classifier and save it to disk using pickle

e)现在，在另一个程序中，将加载保存的模型.

e) Now in a different program, the model saved is loaded.

f)加载测试数据并执行相同的预处理.

f) The test data is loaded and the same preprocessing is performed.

g)LabelEncoder's用于转换分类数据.

g) The LabelEncoder's are used for converting categorical data.

h)该模型用于预测.

h) The model is used to predict.

现在我的问题是，步骤g)是否可以正常工作?

Now the question that I have is, will the step g) work correctly?

如LabelEncoder的文档所述

It can also be used to transform non-numerical labels (as long as 
they are hashable and comparable) to numerical labels.

那么每个条目每次都会散列到完全相同的值吗?

So will each entry hash to the exact same value everytime?

如果否，执行此操作的好方法是什么.有什么方法可以检索编码器的映射?还是与LabelEncoder完全不同的方式?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

推荐答案

根据 LabelEncoder 实现，您描述的管道将在且仅当您在测试时fit LabelEncoders使用完全相同的数据集时才能正常工作唯一值.

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

有一种重用方法，可以重用您在火车上获得的LabelEncoders. LabelEncoder仅具有一个属性，即classes_.您可以腌制它，然后像还原

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

火车:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

测试

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

这似乎比使用相同数据进行重新调整更为有效.

This seems more efficient than refitting it using the same data.

这篇关于在多个程序中正确使用Scikit的LabelEncoder的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！