支持向量机(SVM)是一组有监督学习方法,被用于分类,回归和边界探测
支持向量机有以下的几个优点:
- Effective in high dimensional spaces. 在高维空间有效性
- Still effective in cases where number of dimensions is greater than the number of samples. 在维度数量大于样本数量的时候仍然有效
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. 再决策函数上(支持向量)使用训练点的一个子集,因此内存有效性(占用的空间小)
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. 多功能的:不同的核函数,可以被特别的用于决策函数,普通的核被提供,但是这仍然可能去特异化核函数。
支持向量机也有下面的这些缺点:
- If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial. 如果特征的数量远比样本的数量要大,选择核函数和正则化余项在避免过拟合上是至关重要的。
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below). 支持向量机们不直接提供概率估计,而是用耗费时间的5-fold 交叉验证来计算的。
sklearn中的支持向量机同时支持dense(密集)和 sparse(稀疏)的样本数据作为输入。但是,如果是在稀疏的数据上做预测,那么一定也是要在稀疏的数据上做训练才行。
分类
SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset.
SVC,NuSVC 和LinearSVC是一类在数据集上有能力去实现多类分类的分类器。
注意: LinearSVC不接受key wordkernal
,因为被假设为线性了的。同时相比于SVC和NuSVC也缺少了一些方法,例如:support_
方法
SVC,NuSVC和LinearSVC都一样,接受两个输入X([n_samples, n_features]
)和 y ([n_samples]
)。前者表示样本特征,后者表示样本标签,用于训练。
简单测试
>>> from sklearn import svm
D:\SoftWare\Python\lib\site-packages\sklearn\externals\joblib\externals\cloudpickle\cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
>>> X = [[0, 1], [1, 0]]
>>> y = [0, 1]
>>> clf = svm.SVC(gamma='scale')
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
- 测试也很简单
>>> data = [[0 for i in range(2)] for j in range(2)]
>>> data
[[0, 0], [0, 0]]
>>> for i in range(2):
... for j in range(2):
... data[i][j] = clf.predict([[i , j]])[0]
...
>>> data
[[1, 0], [1, 1]]
多类分类
svc和NuSVC提供了一种“一对一”的方式来实现多类分类。如果n_class
是类的数量的话,那么n_class * (n_class - 1) / 2
个分类器被建立,来不同的两个类别之间的相互区分。
To provide a consistent interface with other classifiers, the decision_function_shape
option allows to aggregate the results of the “one-against-one” classifiers to a decision function of shape (n_samples, n_classes):
为了提供一个和其他分类器一致性的接口,这这个 decision_function_shape
选项,允许去累积这个“1对1”的分类器们去一个决策函数的shape
- 例如:
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1]
6
>>> clf.decision_function_shape = 'ovr'
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1]
4
>>> dec
array([[ 1.95120255, 3.5 , 0.95120255, -0.4024051 ]])
>>> clf
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
非均衡的问题
In problems where it is desired to give more importance to certain classes or certain individual samples keywords class_weight
and sample_weight
can be used.
在某些问题上,需要更多的关注特定的类别,或者是特定的样本个体。这时候,可以使用 class_weight
和 sample_weight