问题描述
我在做什么用k近邻算法在Matlab中的数据分析。我的数据包括约11795 x 88数据矩阵,其中的行的意见和列变量。
我的任务是找到k最近邻居N的测试点。目前,我做它与以下逻辑:
换句话说,我环路所有N个测试点。对于每一个测试点我搜索的数据(不包括测试点本身)对于K最近的邻居欧氏距离。对于每一个测试点这大约需要KX 11794迭代。所以整个过程大约需要nxkx 11794迭代。如果n = 10000和k = 7,这将是约825,6百万次迭代。
有没有计算第k近邻更有效的方法?大部分的计算是现在要浪费,因为我的算法简单:
计算的欧几里得距离的所有其他点,拾取最接近,并排除在进一步的考虑最近点 - >计算的欧几里得距离的所有其他点和拾取最靠近 - >等 - >等等。
有一个聪明的办法来摆脱这种浪费计算?
的目前这个过程大约需要7个小时我的电脑(3.2千兆赫,8 GB内存,64位的Win 7)...:(
下面是一些明确说明的逻辑(这不是我的所有code,但是这是吃了性能的部分):
对于i = 1:尺寸(测试点,1)%循环中的所有测试点
neighborcandidates = all_data_excluding_testpoints; %使用其他数据不包括测试点寻找第k近邻
测试点测试点=(I,:); %这是一个测试点,我们发现K-近邻
kneighbors = []; %在这里存放k最近邻居。
对于j = 1:K%得到k近邻
bdist =天道酬勤; %最接近的邻居的距离
绑定= 0; %在最近的邻居的索引
对于n = 1:尺寸(neighborcandidates,1)%回路的所有候选人
如果pdist([测试点; neighborcandidates(N,:)])< bdist%检查的欧氏距离
bdist = pdist([测试点; neighborcandidates(N,:)]); %更新最佳距离,到目前为止
绑定= N; %保存最好的发现指数至今
结束
结束
kneighbors = [kneighbors; neighborcandidates(绑定,:)]。 %保存发现邻居
neighborcandidates(绑定,:) = []; %拆下进一步考虑邻居
结束
结束
使用 pdist2
:
A =兰特(20,5); %//这是你的11795 x 88
B = A([1,12,4,8],:); %//这是您的正由-88子集,即n = 4时在此情况下
N =尺寸(B,1);
D = pdist2(A,B);
[〜,IND] =排序(D);
kneighbours = IND(2:2 + K,:);
现在,你可以在 A
使用 kneighbours
来索引行。需要注意的是 kneighbours
的列对应于 B
但既然你已经浸入统计工具箱, pdist
为什么不直接使用Matlab的 knnsearch
?
kneighbours_matlab = knnsearch(A,B,'K',K + 1);
请注意, kneighbours
是一样的 kneighbours_matlab(:,2:结束)
I'm doing data analysis using k-nearest neighbor algorithm in Matlab. My data consists of about 11795 x 88 data matrix, where the rows are observations and columns are variables.
My task is to find k-nearest neighbors for n selected test points. Currently I'm doing it with the following logic:
In other words, I loop all the n test points. For each test point I search the data (which excludes the test point itself) for k-nearest neighbors by euclidean distance. For each test point this takes approximately k x 11794 iterations. So the whole process takes about n x k x 11794 iterations. If n = 10000 and k = 7, this would be approximately 825,6 million iterations.
Is there a more efficient way to calculate the k-nearest neighbors? Most of the computation is going to waste now, because my algorithm simply:
calculates the euclidean distance to all the other points, picks up the closest and excludes the closest point from further consideration --> calculates the euclidean distance to all the other points and picks up the closest --> etc. --> etc.
Is there a smart way to get rid of this 'waste calculation'?
Currently this process takes about 7 hours in my computer (3.2 GHz, 8 GB RAM, 64-bit Win 7)... :(
Here is some of the logic illustrated explicitly (this is not all my code, but this is the part that eats up performance):
for i = 1:size(testpoints, 1) % Loop all the test points
neighborcandidates = all_data_excluding_testpoints; % Use the rest of the data excluding the test points in search of the k-nearest neighbors
testpoint = testpoints(i, :); % This is the test point for which we find k-nearest neighbors
kneighbors = []; % Store the k-nearest neighbors here.
for j = 1:k % Find k-nearest neighbors
bdist = Inf; % The distance of the closest neighbor
bind = 0; % The index of the closest neighbor
for n = 1:size(neighborcandidates, 1) % Loop all the candidates
if pdist([testpoint; neighborcandidates(n, :)]) < bdist % Check the euclidean distance
bdist = pdist([testpoint; neighborcandidates(n, :)]); % Update the best distance so far
bind = n; % Save the best found index so far
end
end
kneighbors = [kneighbors; neighborcandidates(bind, :)]; % Save the found neighbour
neighborcandidates(bind, :) = []; % Remove the neighbor from further consideration
end
end
Using pdist2
:
A = rand(20,5); %// This is your 11795 x 88
B = A([1, 12, 4, 8], :); %// This is your n-by-88 subset, i.e. n=4 in this case
n = size(B,1);
D = pdist2(A,B);
[~, ind] = sort(D);
kneighbours = ind(2:2+k, :);
Now you can use kneighbours
to index a row in A
. Note that the columns of kneighbours
correspond to the rows of B
But since you're already dipping into the stats toolbox with pdist
why not just use Matlab's knnsearch
?
kneighbours_matlab = knnsearch(A,B,'K',k+1);
note that kneighbours
is the same as kneighbours_matlab(:,2:end)'
这篇关于如何进行有效的k近邻计算在Matlab的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!