SVM支持向量机--sklearn研究

支持向量机（SVM）是一组有监督学习方法，被用于分类，回归和边界探测

支持向量机有以下的几个优点：

Effective in high dimensional spaces. 在高维空间有效性
Still effective in cases where number of dimensions is greater than the number of samples. 在维度数量大于样本数量的时候仍然有效
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. 再决策函数上（支持向量）使用训练点的一个子集，因此内存有效性（占用的空间小）
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. 多功能的：不同的核函数，可以被特别的用于决策函数，普通的核被提供，但是这仍然可能去特异化核函数。

支持向量机也有下面的这些缺点：

If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial. 如果特征的数量远比样本的数量要大，选择核函数和正则化余项在避免过拟合上是至关重要的。
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below). 支持向量机们不直接提供概率估计，而是用耗费时间的5-fold 交叉验证来计算的。

sklearn中的支持向量机同时支持dense（密集）和 sparse（稀疏）的样本数据作为输入。但是，如果是在稀疏的数据上做预测，那么一定也是要在稀疏的数据上做训练才行。

分类

SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset.

SVC，NuSVC 和LinearSVC是一类在数据集上有能力去实现多类分类的分类器。

注意： LinearSVC不接受key wordkernal，因为被假设为线性了的。同时相比于SVC和NuSVC也缺少了一些方法，例如：support_方法

SVC，NuSVC和LinearSVC都一样，接受两个输入X（[n_samples, n_features]）和 y ([n_samples] )。前者表示样本特征，后者表示样本标签，用于训练。

简单测试

>>> from sklearn import svm
D:\SoftWare\Python\lib\site-packages\sklearn\externals\joblib\externals\cloudpickle\cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
>>> X = [[0, 1], [1, 0]]
>>> y = [0, 1]
>>> clf = svm.SVC(gamma='scale')
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

测试也很简单

>>> data =  [[0 for i in range(2)] for j in range(2)]
>>> data
[[0, 0], [0, 0]]
>>> for i in range(2):
...     for j in range(2):
...             data[i][j] = clf.predict([[i , j]])[0]
...
>>> data
[[1, 0], [1, 1]]

多类分类

svc和NuSVC提供了一种“一对一”的方式来实现多类分类。如果n_class是类的数量的话，那么n_class * (n_class - 1) / 2 个分类器被建立，来不同的两个类别之间的相互区分。

To provide a consistent interface with other classifiers, the decision_function_shape option allows to aggregate the results of the “one-against-one” classifiers to a decision function of shape (n_samples, n_classes):
为了提供一个和其他分类器一致性的接口，这这个 decision_function_shape 选项，允许去累积这个“1对1”的分类器们去一个决策函数的shape

例如：

>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1]
6
>>> clf.decision_function_shape = 'ovr'
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1]
4
>>> dec
array([[ 1.95120255,  3.5       ,  0.95120255, -0.4024051 ]])
>>> clf
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

非均衡的问题

In problems where it is desired to give more importance to certain classes or certain individual samples keywords class_weight and sample_weight can be used.

在某些问题上，需要更多的关注特定的类别，或者是特定的样本个体。这时候，可以使用 class_weight 和 sample_weight

肥宅_Sean