CS231n 2016 通关第二章-KNN 作业分析

KNN作业要求：

1、掌握KNN算法原理

2、实现具体K值的KNN算法

3、实现对K值的交叉验证

1、KNN原理见上一小节

2、实现KNN

　　过程分两步：

　　　　1、计算测试集与训练集的距离

　　　　2、通过比较label出现比例的方式，确定选取的最终label

　　代码分析：

　　cell1 - cell5 对数据的预处理

　　cell6创建KNN类，初始化类的变量，此处是传递测试数据和训练数据

　　cell7实现包含两个循环的KNN算法：

　　　　通过计算单一的向量与矩阵之间的距离（在之前的cell中，已经将图像转换成列：32*32 的图像转换为 1*3072,，

　　　　测试集是500张：500*3072，训练集是5000张：5000*3072）

　　代码基础：使用python 2.7.9 + numpy 1.11.0

　　技巧：使用help 查看相关函数的用法，或者google

　　　　举例：np.square

　　　　　　 CS231n 2016 通关第二章-KNN 作业分析-LMLPHP 　　　　

　　　　　　q 键退出help　　　　　　

　　　　　　 CS231n 2016 通关第二章-KNN 作业分析-LMLPHP

　　　　　　可知，np.square() 为了加快运算速度，是用c写的，在这里查不到具体用法。google查看:

　　　　　　 CS231n 2016 通关第二章-KNN 作业分析-LMLPHP

　　　　　　　　例子为计算数组[-1j,1]里边各元素的平方，得到的结果为[-1,1]

　　代码：实现compute_distances_two_loops(self, X)

 1   def compute_distances_two_loops(self, X):

     """

     Compute the distance between each test point in X and each training point

     in self.X_train using a nested loop over both the training data and the

     test data.

     Inputs:

     - X: A numpy array of shape (num_test, D) containing test data.

     Returns:

     - dists: A numpy array of shape (num_test, num_train) where dists[i, j]

       is the Euclidean distance between the ith test point and the jth training

       point.

     """

     num_test = X.shape[0]

     num_train = self.X_train.shape[0]

     dists = np.zeros((num_test, num_train))

     for i in xrange(num_test):

       for j in xrange(num_train):

         #####################################################################

         # TODO:                                                             #

         # Compute the l2 distance between the ith test point and the jth    #

         # training point, and store the result in dists[i, j]. You should   #

         # not use a loop over dimension.                                    #

         #####################################################################

         dists[i,j] = np.sqrt(np.sum(np.square(X[i,:]-self.X_train[j,:])))

         #####################################################################

         #                       END OF YOUR CODE                            #

         #####################################################################

     return dists

　　　　实现对一张测试图像对应的矩阵与一张训练集图像的矩阵做L2距离。

　　　　也可以用numpy.linalg.norm函数实现：

　　　　　　此函数执行的公式： CS231n 2016 通关第二章-KNN 作业分析-LMLPHP

　　　　　　所以核心代码可以写作：

　　　　　　　　dists[i,j] = np.linalg.norm(self.X_train[j,:]-X[i,:])

　　cell8 得到的距离可视化，白色表示较大的距离值，黑色是较小距离值

　　cell9 实现K=1的label预测

　　代码：实现 classifier.predict_labels()

   def predict_labels(self, dists, k=1):

     """

     Given a matrix of distances between test points and training points,

     predict a label for each test point.

     Inputs:

     - dists: A numpy array of shape (num_test, num_train) where dists[i, j]

       gives the distance betwen the ith test point and the jth training point.

     Returns:

     - y: A numpy array of shape (num_test,) containing predicted labels for the

       test data, where y[i] is the predicted label for the test point X[i].

     """

     num_test = dists.shape[0]

     y_pred = np.zeros(num_test)

     for i in xrange(num_test):

       # A list of length k storing the labels of the k nearest neighbors to

       # the ith test point.

       closest_y = []

       count = []

       #########################################################################

       # TODO:                                                                 #

       # Use the distance matrix to find the k nearest neighbors of the ith    #

       # testing point, and use self.y_train to find the labels of these       #

       # neighbors. Store these labels in closest_y.                           #

       # Hint: Look up the function numpy.argsort.                             #

       #########################################################################

       buf_labels = self.y_train[np.argsort(dists[i,:])]

       closest_y = buf_labels[0:k]

       #########################################################################

       # TODO:                                                                 #

       # Now that you have found the labels of the k nearest neighbors, you    #

       # need to find the most common label in the list closest_y of labels.   #

       # Store this label in y_pred[i]. Break ties by choosing the smaller     #

       # label.                                                                #

       #########################################################################

       #for j in closest_y :

       #  count.append(closest_y.count(j))

       #m = max(count)

       #n = count.index(m)

       #y_pred[i] = closest_y[n]

       c = Counter(closest_y)

       y_pred[i] = c.most_common(1)[0][0]

       #########################################################################

       #                           END OF YOUR CODE                            #

       #########################################################################

     return y_pred

　　　　　　　　步骤：

　　　　　　　　　　1.使用numpy.argsort对所以距离进行排序，得到排序后的索引。

　　　　　　　　　　2.通过索引找到对应的label

　　　　　　　　　　3.通过collection包的Counter，对label进行统计表示

　　　　　　　　　　4.通过counter的Most common方法得到出现最多的label

　　cell9 在计算完成后，同时实现了准确率的计算