k-means原理和python代码实现

k-means:是无监督的分类算法

k代表要分的类数，即要将数据聚为k类; means是均值，代表着聚类中心的迭代策略.

k-means算法思想:

（1）随机选取k个聚类中心（一般在样本集中选取，也可以自己随机选取）;

（2）计算每个样本与k个聚类中心的距离，并将样本归到距离最小的那个类中;

（3）更新中心，计算属于k类的样本的均值作为新的中心。

（4）反复迭代（2）（3）,直到聚类中心不发生变化，后者中心位置误差在阈值范围内，或者达到一定的迭代次数。

python实现：

import numpy as np


def iou(box, clusters):
    """
    Calculates the Intersection over Union (IoU) between a box and k clusters.
    :param box: tuple or array, shifted to the origin (i. e. width and height)
    :param clusters: numpy array of shape (k, 2) where k is the number of clusters
    :return: numpy array of shape (k, 0) where k is the number of clusters
    """
    x = np.minimum(clusters[:, 0], box[0])
    y = np.minimum(clusters[:, 1], box[1])
    if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
        raise ValueError("Box has no area")
    intersection = x * y
    box_area = box[0] * box[1]
    cluster_area = clusters[:, 0] * clusters[:, 1]
    iou_ = intersection / (box_area + cluster_area - intersection)
    return iou_

def kmeans(boxes, k, dist=np.median):
    """
    Calculates k-means clustering with the Intersection over Union (IoU) metric.
    :param boxes: numpy array of shape (r, 2), where r is the number of rows
    :param k: number of clusters
    :param dist: distance function
    :return: numpy array of shape (k, 2)
    """
    rows = boxes.shape[0]

    distances = np.empty((rows, k)) #初始化距离矩阵，rows代表样本数量，k代表聚类数量，用于存放每个样本对应每个聚类中心的距离
    last_clusters = np.zeros((rows,))#记录上一次样本所属的类型

    np.random.seed()

    # the Forgy method will fail if the whole array contains the same rows
    clusters = boxes[np.random.choice(rows, k, replace=False)]#从样本中随机选取聚类中心

    while True:
        for row in range(rows):
            distances[row] = 1 - iou(boxes[row], clusters) #这里是距离计算公式，这里是为了适应yolov3选取anchorbox的度量需求
        nearest_clusters = np.argmin(distances, axis=1)    #找到距离最小的类
        if (last_clusters == nearest_clusters).all(): #判断是否满足终止条件
            break
        for cluster in range(k):                        #更新聚类中心
            clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0) #将某一类的均值更新为聚类中心
        last_clusters = nearest_clusters
    return clusters