问题描述
我手头的基本任务是
a)读取一些制表符分隔的数据.
a) Read some tab separated data.
b)做一些基本的预处理
b) Do some basic preprocessing
c)对于每个类别列,请使用LabelEncoder
创建映射.这有点像这样
c) For each categorical column use LabelEncoder
to create a mapping. This is don somewhat like this
mapper={}
#Converting Categorical Data
for x in categorical_list:
mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
df[x]=mapper[x].fit_transform(df.__getattr__(x))
其中df
是熊猫数据框,而categorical_list
是需要转换的列标题的列表.
where df
is a pandas dataframe and categorical_list
is a list of column headers that need to be transformed.
d)训练分类器,并使用pickle
d) Train a classifier and save it to disk using pickle
e)现在,在另一个程序中,将加载保存的模型.
e) Now in a different program, the model saved is loaded.
f)加载测试数据并执行相同的预处理.
f) The test data is loaded and the same preprocessing is performed.
g)LabelEncoder's
用于转换分类数据.
g) The LabelEncoder's
are used for converting categorical data.
h)该模型用于预测.
h) The model is used to predict.
现在我的问题是,步骤g)
是否可以正常工作?
Now the question that I have is, will the step g)
work correctly?
如LabelEncoder
的文档所述
It can also be used to transform non-numerical labels (as long as
they are hashable and comparable) to numerical labels.
那么每个条目每次都会散列到完全相同的值吗?
So will each entry hash to the exact same value everytime?
如果否,执行此操作的好方法是什么.有什么方法可以检索编码器的映射?还是与LabelEncoder完全不同的方式?
If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?
推荐答案
根据 LabelEncoder
实现,您描述的管道将在且仅当您在测试时fit
LabelEncoders使用完全相同的数据集时才能正常工作唯一值.
According to the LabelEncoder
implementation, the pipeline you've described will work correctly if and only if you fit
LabelEncoders at the test time with data that have exactly the same set of unique values.
有一种重用方法,可以重用您在火车上获得的LabelEncoders. LabelEncoder
仅具有一个属性,即classes_
.您可以腌制它,然后像还原
There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder
has only one property, namely, classes_
. You can pickle it, and then restore like
火车:
encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)
测试
encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`
这似乎比使用相同数据进行重新调整更为有效.
This seems more efficient than refitting it using the same data.
这篇关于在多个程序中正确使用Scikit的LabelEncoder的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!