问题描述
我正在尝试针对大量项目和类训练SVM分类器,这变得非常非常慢.
I'm trying to train an SVM classifier on big number of items and classes, which becomes really, really slow.
首先,我从数据中提取了一个特征集,以整体上确定为512个特征并将其放入numpy数组中.此数组中有13k项.看起来像这样:
First of all, I've extracted a feature set from my data, to be specific 512 features overall and put it in numpy array. There are 13k items in this array. It looks like that:
>>print(type(X_train))
<class 'numpy.ndarray'>
>>print(X_train)
[[ 0.01988654 -0.02607637 0.04691431 ... 0.11521499 0.03433102
0.01791015]
[-0.00058317 0.05720023 0.03854145 ... 0.07057668 0.09192026
0.01479562]
[ 0.01506544 0.05616265 0.01514515 ... 0.04981219 0.05810429
0.00232013]
...
另外,大约有4000种不同的类别:
Also, there are ~4k of different classes:
>> print(type(labels))
<class 'list'>
>> print(labels)
[0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, ... ]
这是分类器:
import pickle
from thundersvmScikit import SVC
FILENAME = 'dataset.pickle'
with open(FILENAME, 'rb') as infile:
(X_train, labels) = pickle.load(infile)
clf = SVC(kernel='linear', probability=True)
clf.fit(X_train, labels)
大约90个小时过去之后(并且我正在以thundersvm的形式使用sci-learn工具包的GPU实现)拟合操作仍在运行.考虑到在我的情况下这是一个很小的数据集,我当然需要更高效的东西,但是我似乎并没有取得任何成功.例如,我尝试过这种类型的Keras模型:
After ~90 hours has passed (and I'm using GPU implementation of sci-learn kit in a form of thundersvm) fit operation is still running. Taking into account that it is a pretty small dataset in my case I definitely need something more efficient, but I don't seem to have any good success with that. For example, I've tried this type of Keras model:
model = Sequential()
model.add(Dense(input_dim=512, units=100, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(units=n_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
model.fit(X_train, labels, epochs=500, batch_size=64, validation_split=0.1, shuffle=True)
在训练阶段,我最终获得了很好的准确性:
I end up with pretty good accuracy during the training stage:
Epoch 500/500
11988/11988 [==============================] - 1s 111us/step - loss: 2.1398 - acc: 0.8972 - val_loss: 9.5077 - val_acc: 0.0000e+00
但是,在实际测试中,即使对训练数据集中存在的数据,我的准确性也非常低,基本上可以预测随机类别:
However, during the actual testing even on the data that was present in the training dataset I got extremely low accuracy, predicting basically random classes:
Predictions (best probabilities):
0 class710015: 0.008
1 class715573: 0.007
2 class726619: 0.006
3 class726619: 0.010
4 class720439: 0.007
Accuracy: 0.000
能否请您为此指出正确的方向?我应该以某种方式调整SVM方法还是应该针对此类问题切换到自定义Keras模型?如果是,我的模型可能有什么问题?
Could you, please, point me in the right direction with this? Should I adjust SVM approach somehow or should I switch to custom Keras model for this type of a problem? If yes, what is the possible problem with my model?
非常感谢.
推荐答案
SVM对于二进制分类是最自然的.对于多类,scikit-learn使用一对多的组合来组合O(K ^ 2)二进制分类器( https://scikit-learn.org/stable/modules/svm.html ),类数为K.因此,运行时间与K ^ 2成正比,在您的情况下为1600万.这就是为什么它这么慢的原因.
SVM is most natural for binary classification. For multiclass, scikit-learn uses one-versus-one to combine O(K^2) binary classifiers (https://scikit-learn.org/stable/modules/svm.html), with K the number of classes. So, the running time is proportional to K^2, or in your case, 16 million. This is the reason why it is so slow.
您应该减少类的数量,或者切换到其他模型,例如神经网络或决策树.
You should either reduce the number of classes, or switch to other models such as neural networks or decision trees.
PS:scikit-learn还具有SVM的万能方法( https ://scikit-learn.org/stable/modules/multiclass.html ),即O(K).您也可以尝试这个.
P.S: scikit-learn also has one-vs-all approach for SVM (https://scikit-learn.org/stable/modules/multiclass.html), which is O(K). You could also try this.
这篇关于在大量课程上训练分类器时,SVM非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!