如何进行有效的k近邻计算在Matlab

本文介绍了如何进行有效的k近邻计算在Matlab的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在做什么用k近邻算法在Matlab中的数据分析。我的数据包括约11795 x 88数据矩阵，其中的行的意见和列变量。

我的任务是找到k最近邻居N的测试点。目前，我做它与以下逻辑：

换句话说，我环路所有N个测试点。对于每一个测试点我搜索的数据（不包括测试点本身）对于K最近的邻居欧氏距离。对于每一个测试点这大约需要KX 11794迭代。所以整个过程大约需要nxkx 11794迭代。如果n = 10000和k = 7，这将是约825,6百万次迭代。

有没有计算第k近邻更有效的方法？大部分的计算是现在要浪费，因为我的算法简单：

计算的欧几里得距离的所有其他点，拾取最接近，并排除在进一步的考虑最近点 - >计算的欧几里得距离的所有其他点和拾取最靠近 - >等 - >等等。

有一个聪明的办法来摆脱这种浪费计算？

的

目前这个过程大约需要7个小时我的电脑（3.2千兆赫，8 GB内存，64位的Win 7）...：（

下面是一些明确说明的逻辑（这不是我的所有code，但是这是吃了性能的部分）：

 对于i = 1：尺寸（测试点，1）％循环中的所有测试点
    neighborcandidates = all_data_excluding_testpoints; ％使用其他数据不包括测试点寻找第k近邻
    测试点测试点=（I，:); ％这是一个测试点，我们发现K-近邻
    kneighbors = []; ％在这里存放k最近邻居。
    对于j = 1：K％得到k近邻
        bdist =天道酬勤; ％最接近的邻居的距离
        绑定= 0; ％在最近的邻居的索引
        对于n = 1：尺寸（neighborcandidates，1）％回路的所有候选人
            如果pdist（[测试点; neighborcandidates（N，:)]）＆LT; bdist％检查的欧氏距离
                bdist = pdist（[测试点; neighborcandidates（N，:)]）; ％更新最佳距离，到目前为止
                绑定= N; ％保存最好的发现指数至今
            结束
        结束
        kneighbors = [kneighbors; neighborcandidates（绑定，:)]。 ％保存发现邻居
        neighborcandidates（绑定，:) = []; ％拆下进一步考虑邻居
    结束
结束

解决方案

使用 pdist2 ：

  A =兰特（20,5）; ％//这是你的11795 x 88
B = A（[1，12，4,8]，:); ％//这是您的正由-88子集，即n = 4时在此情况下
N =尺寸（B，1）;

D = pdist2（A，B）;
[〜，IND] =排序（D）;
kneighbours = IND（2：2 + K，:);

现在，你可以在 A 使用 kneighbours 来索引行。需要注意的是 kneighbours 的列对应于 B

的行

但既然你已经浸入统计工具箱， pdist 为什么不直接使用Matlab的 knnsearch ？

  kneighbours_matlab = knnsearch（A，B，'K'，K + 1）;

请注意， kneighbours 是一样的 kneighbours_matlab（：，2：结束）

I'm doing data analysis using k-nearest neighbor algorithm in Matlab. My data consists of about 11795 x 88 data matrix, where the rows are observations and columns are variables.

My task is to find k-nearest neighbors for n selected test points. Currently I'm doing it with the following logic:

In other words, I loop all the n test points. For each test point I search the data (which excludes the test point itself) for k-nearest neighbors by euclidean distance. For each test point this takes approximately k x 11794 iterations. So the whole process takes about n x k x 11794 iterations. If n = 10000 and k = 7, this would be approximately 825,6 million iterations.

Is there a more efficient way to calculate the k-nearest neighbors? Most of the computation is going to waste now, because my algorithm simply:

calculates the euclidean distance to all the other points, picks up the closest and excludes the closest point from further consideration --> calculates the euclidean distance to all the other points and picks up the closest --> etc. --> etc.

Is there a smart way to get rid of this 'waste calculation'?

Currently this process takes about 7 hours in my computer (3.2 GHz, 8 GB RAM, 64-bit Win 7)... :(

Here is some of the logic illustrated explicitly (this is not all my code, but this is the part that eats up performance):

for i = 1:size(testpoints, 1) % Loop all the test points
    neighborcandidates = all_data_excluding_testpoints; % Use the rest of the data excluding the test points in search of the k-nearest neighbors
    testpoint = testpoints(i, :); % This is the test point for which we find k-nearest neighbors
    kneighbors = []; % Store the k-nearest neighbors here.
    for j = 1:k % Find k-nearest neighbors
        bdist = Inf; % The distance of the closest neighbor
        bind = 0; % The index of the closest neighbor
        for n = 1:size(neighborcandidates, 1) % Loop all the candidates
            if pdist([testpoint; neighborcandidates(n, :)]) < bdist % Check the euclidean distance
                bdist = pdist([testpoint; neighborcandidates(n, :)]); % Update the best distance so far
                bind = n; % Save the best found index so far
            end
        end
        kneighbors = [kneighbors; neighborcandidates(bind, :)]; % Save the found neighbour
        neighborcandidates(bind, :) = []; % Remove the neighbor from further consideration
    end
end

解决方案

Using pdist2:

A = rand(20,5);             %// This is your 11795 x 88
B = A([1, 12, 4, 8], :);    %// This is your n-by-88 subset, i.e. n=4 in this case
n = size(B,1);

D = pdist2(A,B);
[~, ind] = sort(D);
kneighbours = ind(2:2+k, :);

Now you can use kneighbours to index a row in A. Note that the columns of kneighbours correspond to the rows of B

But since you're already dipping into the stats toolbox with pdist why not just use Matlab's knnsearch?

kneighbours_matlab = knnsearch(A,B,'K',k+1);

note that kneighbours is the same as kneighbours_matlab(:,2:end)'

这篇关于如何进行有效的k近邻计算在Matlab的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！