本文介绍了使用RBF内核SVM时,高的c或gamma值是否有问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用WEKA/LibSVM来训练术语提取系统的分类器.我的数据不是线性可分离的,因此我使用了RBF内核而不是线性内核.
我遵循了Hsu等人的指南,迭代了c和gamma的几个值.最适合用于对已知术​​语进行分类的参数(测试和培训材料当然有所不同)非常高,c = 2 ^ 10和gamma = 2 ^ 3.
到目前为止,高参数似乎还可以,但是我想知道它们是否会进一步引起任何问题,尤其是在过拟合方面.我计划通过提取新的术语来进行另一项评估,但是由于我需要人工判断,这些术语的成本很高.
即使两个评估都得出肯定的结果,我的参数仍然有问题吗?我是否可能需要其他内核类型?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?

非常感谢!

推荐答案

通常,您必须执行交叉验证以回答参数是否正确或是否导致过拟合.

In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.

从直觉"的角度来看-似乎是过度拟合的模型.较高的伽玛值意味着您的高斯人非常狭窄(在每个点上都凝聚),再加上较高的C值将导致您记住大部分训练集.如果您检查支持向量的数量,如果它是整个数据的50%,我将不会感到惊讶.其他可能的解释是您没有扩展数据.大多数ML方法(尤其是SVM)都需要对数据进行正确预处理.特别是,这意味着您应该归一化(标准化)输入数据,以便它或多或少地包含在单位球体内.

From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.

这篇关于使用RBF内核SVM时,高的c或gamma值是否有问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 04:51