


I have build a small program that creates a classifier for a given dataset with scikit-learn. Now I wanted to try this example, to see the classifier at work. For example the clf has to detect "cats".


This is how I go on:


I have 50 pictures of Cats and 50 pictures of "none cats".

  1. 使用筛选功能检测器获取 data_set 的描述符

  2. 将数据分为训练集和测试集(25张猫的猫+ 25张非猫的猫= training_set,test_set相同)

  3. 从<$ c $中获取具有kmeans的聚类中心c> training_set

  4. 创建 training_set test_set 的直方图数据code>通过使用群集中心

  5. 从scikit-learn中尝试以下代码:

  1. get descriptors for data_set with sift-feature detector
  2. Split data into training set and test set (25 pictures cats + 25 pictures non cats = training_set, test_set same)
  3. get cluster centers with kmeans from the training_set
  4. create histogramm data of the training_set an test_set by using the cluster centers
  5. try this code from scikit-learn:

# Tuning hyper-parameters for recall
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 0.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 0.].
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 1.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 1.].
Best parameters set found on development set:
SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel=rbf, max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
Grid scores on development set:
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.0001}
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.]
             precision    recall  f1-score   support

        0.0       1.00      0.04      0.08        25
        1.0       0.51      1.00      0.68        25

avg / total       0.76      0.52      0.38        50

{'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}


seems to be that the clf says to all thinks its a cat....but why?

data_set 是否变小以获得良好的结果?

Is the data_set to small to get a good result ?


I'm using VLFeat to detecting sift descriptor


def create_descriptor_data(data, ID):
    descriptor_list = []
    datas = numpy.genfromtxt(data,dtype='str')
    for p in datas:
      locs, desc = vlfeat_module.vlf_create_descriptors(p,str(ID)+'.key',ID) # create descriptors and save descs in file
      if len(desc) > 500:
        desc = desc[::round((len(desc))/400, 1)] # take between 400 - 800 descriptors
      ID += 1 # ID for filename
    return descriptor_list

# create k-mean centers from all *.txt files in directory (data)
def create_center_data(data):
    #data = numpy.vstack(data)
    n_clusters = len(numpy.unique(data))
    kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
    return kmeans, n_clusters

def create_histogram_data(kmeans, descs, n_clusters):
    histogram_list = []
    # load from each file data
    for desc in descs:
      length = len(desc)
      # create histogram from descriptors
      histogram = kmeans.predict(desc)
      histogram = numpy.bincount(histogram, minlength=n_clusters) #minlength = k in k-means
      histogram = numpy.divide(histogram, length, dtype='float')
    histogram = numpy.vstack(histogram_list)
    return histogram


X_desc_pos = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_pos.txt",0) # create desc from dataset_pos, 25 pics
X_desc_neg = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_neg.txt",51) # create desc from dataset_neg, 25 pics

X_train_pos, X_test_pos = train_test_split(X_desc_pos, test_size=0.5)
X_train_neg, X_test_neg = train_test_split(X_desc_neg, test_size=0.5)

x1 = numpy.vstack(X_train_pos)
x2 = numpy.vstack(X_train_neg)
kmeans, n_clusters = lib.dataset_module.create_center_data(numpy.vstack((x1,x2)))

X_train_pos = lib.dataset_module.create_histogram_data(kmeans, X_train_pos, n_clusters)
X_train_neg = lib.dataset_module.create_histogram_data(kmeans, X_train_neg, n_clusters)

X_train = numpy.vstack([X_train_pos, X_train_neg])
y_train = numpy.hstack([numpy.ones(len(X_train_pos)), numpy.zeros(len(X_train_neg))])

X_test_pos = lib.dataset_module.create_histogram_data(kmeans, X_test_pos, n_clusters)
X_test_neg = lib.dataset_module.create_histogram_data(kmeans, X_test_neg, n_clusters)

X_test = numpy.vstack([X_test_pos, X_test_neg])
y_test = numpy.hstack([numpy.ones(len(X_test_pos)), numpy.zeros(len(X_test_neg))])

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)

    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print("Grid scores on development set:")
    for params, mean_score, scores in clf.grid_scores_:
       print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() / 2, params))
    print("Detailed classification report:")
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    y_true, y_pred = y_test, clf.predict(X_test)
    print y_true
    print y_pred
    print(classification_report(y_true, y_pred))
    print clf.score(X_train, y_train)
    print "score"
    print clf.best_params_
    print "best_params"
    pred = clf.predict(X_test)
    print accuracy_score(y_test, pred)
    print "accuracy_score"


Some changes by updating the range and savae again the "accuracy"

# Tuning hyper-parameters for accuracy
Best parameters set found on development set:
SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=1.0, kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
Grid scores on development set:
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.
  1.  1.  1.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
             precision    recall  f1-score   support

        0.0       0.88      0.92      0.90        25
        1.0       0.92      0.88      0.90        25

avg / total       0.90      0.90      0.90        50

{'kernel': 'rbf', 'C': 1000.0, 'gamma': 1.0}


but by testing it on a picture with

rslt = clf.predict(test_histogram)


he's still saying to a sofa: "you're a cat" :D



There are many possibilities of such behaviour:

  • 创建训练/测试数据时出现错误[执行错误]

  • 训练集20元素(25个带有5个交叉验证的向量,其中20个用于交叉处理)可能太小而无法很好地泛化[在拟合条件下]

  • 选中的 C gamma 参数可能太窄-此变量高度依赖数据,表示的值可能需要完全不同的 C gamma 然后是当前使用的[欠拟合/过度拟合]

  • There is an error in creation of the training/testing data [implementation error]
  • Training set of 20 element (25 vectors with 5 cross validation leaves 20 for trianing) can be too small for a good generalization [under fitting]
  • range of checked C and gamma parameters can be too narrow - this variables are highly data dependent, your representations' values can require completely different C's and gamma's then those currently used [under/over fitting]

我个人的猜测(因为没有数据很难重现问题),这是第三个选择-错误的 C gamma 参数来找到一个好的模型。

My personal guess (as without the data is hard to reproduce the issue) here is the third option - bad C and gamma parameters to find a good model.



You should try much bigger ranges of values, eg.

  • C 在$code> 10之间-5 10 ^ 15

  • 伽玛 10 ^ -14 10 ^ 2

  • C between 10^-5 and 10^15
  • gamma between 10^-14 and 10^2

for i in range(21): C.append(10.0**(i-5))
for i in range(17): gamma.append(10**(i-14))



Once parameters' ranges are corrected, now you should perform the actual "case study". Gather more images, analyze your data representation (is histogram really enough for this task?), process your data (is it already normalized? Maybe try some decorrelation?), consider using simplier kernels - rbf can be very deceptive - on one hand it can get great scores during training, but on the other - fail completely during testing. This is a result of its overfitting capabilities (as for any consistent data set RBF-SVM can achieve 100% score during training), so finding a balance between a model's power and generalization abilities is a hard problem. This is when actual "machine learning journey" begins, have fun!


