问题描述
我正在使用scikit库来使用svm.我有大量无法阅读的数据,无法提供给 fit()
函数.
我想对文件中的所有数据进行迭代,并一步一步地训练svm.有什么办法可以做到这一点.文档尚不清楚,在他们的教程中他们一次将完整的数据提供给fit
.
有没有一种方法可以一一训练它(手段可能类似于对训练数据的每个输入模式调用fit
.
I am using scikit library for using svm. I have huge amount of data which I can't read together to give fit()
function.
I want to give iterate over all my data which is in a file and train svm one by one. Is there any way to do this. It is not clear form the documentation and in their tutorial they are giving complete data to fit
at once.
Is there any way to train it one by one (means may be something like calling fit
for every input pattern of the training data).
推荐答案
支持向量机(至少是在scikit-learn所包装的libsvm中实现的)从根本上说是一个批处理算法:它需要访问所有一次将数据存储在内存中.因此它们是不可扩展的.
Support Vector Machine (at least as implemented in libsvm which scikit-learn is a wrapper of) is fundamentally a batch algorithm: it needs to have access to all the data in memory at once. Hence they are not scalable.
相反,您应该使用支持通过partial_fit
方法进行增量学习的模型.例如,某些线性模型(例如sklearn.linear_model.SGDClassifier
)支持partial_fit
方法.您可以切片数据集并将其加载为形状为(batch_size, n_features)
的一系列迷你批. batch_size
可以为1,但由于python解释器的开销(+数据负载开销)而没有效率.因此,建议至少抽出至少100个样品.
Instead you should use models that support incremental learning with the partial_fit
method. For instance some linear models such as sklearn.linear_model.SGDClassifier
support the partial_fit
method. You can slice your dataset and load it as a sequence of minibatches with shape (batch_size, n_features)
. batch_size
can be 1 but is not efficient because the of the python interpreter overhead (+ the data load overhead). So it is recommended to lead samples by minitaches of a least 100.
这篇关于一对一培训scikit svm(在线或随机培训)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!