问题描述
我正在使用Scikit-Learn(sklearn)来对一所有Logistic回归分类器.我有一个很大的数据集,它太慢了,无法一次全部运行.我也想随着训练的进行研究学习曲线.
I'm playing with a one-vs-all Logistic Regression classifier using Scikit-Learn (sklearn). I have a large dataset that is too slow to run all at one go; also I would like to study the learning curve as the training proceeds.
我想使用批量梯度下降来训练我的分类器,例如500个样本.有什么方法可以使用sklearn来做到这一点,还是应该放弃sklearn并自己动手"?
I would like to use batch gradient descent to train my classifier in batches of, say, 500 samples. Is there some way of using sklearn to do this, or should I abandon sklearn and "roll my own"?
这是我到目前为止所拥有的:
This is what I have so far:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# xs are subsets of my training data, ys are ground truth for same; I have more
# data available for further training and cross-validation:
xs.shape, ys.shape
# => ((500, 784), (500))
lr = OneVsRestClassifier(LogisticRegression())
lr.fit(xs, ys)
lr.predict(xs[0,:])
# => [ 1.]
ys[0]
# => 1.0
即它可以正确地识别出训练样本(是的,我知道最好使用新数据对其进行评估-这只是一个快速的冒烟测试).
I.e. it correctly identifies a training sample (yes, I realize it would be better to evaluate it with new data -- this is just a quick smoke-test).
R.e.批梯度下降:我还没有创建学习曲线,但是可以简单地对训练数据的后续子集重复运行fit
吗?还是有其他一些功能需要分批训练?该文档和Google在此问题上都保持沉默.谢谢!
R.e. batch gradient descent: I haven't gotten as far as creating learning curves, but can one simply run fit
repeatedly on subsequent subsets of the training data? Or is there some other function to train in batches? The documentation and Google are fairly silent on the matter. Thanks!
推荐答案
您想要的不是批处理梯度下降,而是随机梯度下降;批处理学习意味着一次性学习整个训练集,而您所描述的被正确地称为小批量学习.这是在sklearn.linear_model.SGDClassifier
中实现的,如果为它提供选项loss="log"
,则它适合于逻辑回归模型.
What you want is not batch gradient descent, but stochastic gradient descent; batch learning means learning on the entire training set in one go, while what you describe is properly called minibatch learning. That's implemented in sklearn.linear_model.SGDClassifier
, which fits a logistic regression model if you give it the option loss="log"
.
使用SGDClassifier
时,与使用LogisticRegression
时一样,不需要将估算器包装在OneVsRestClassifier
中-两者都可以进行开箱即用的一对一训练.
With SGDClassifier
, like with LogisticRegression
, there's no need to wrap the estimator in a OneVsRestClassifier
-- both do one-vs-all training out of the box.
# you'll have to set a few other options to get good estimates,
# in particular n_iterations, but this should get you going
lr = SGDClassifier(loss="log")
然后,要使用小批量训练,请使用partial_fit
方法而不是fit
.第一次,您必须向它提供一个类列表,因为并非每个小批处理中都可能包含所有类:
Then, to train on minibatches, use the partial_fit
method instead of fit
. The first time around, you have to feed it a list of classes because not all classes may be present in each minibatch:
import numpy as np
classes = np.unique(["ham", "spam", "eggs"])
for xs, ys in minibatches:
lr.partial_fit(xs, ys, classes=classes)
(在这里,我为每个小批量传递classes
,这不是必需的,但也不会造成任何伤害,并使代码更短.)
(Here, I'm passing classes
for each minibatch, which isn't necessary but doesn't hurt either and makes the code shorter.)
这篇关于使用scikit Learn的批次梯度下降(sklearn)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!