欧几里得距离(python3，sklearn):有效地计算最接近的对及其对应的距离

本文介绍了欧几里得距离(python3，sklearn):有效地计算最接近的对及其对应的距离的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我得到了一个由浮点值组成的二维numpy数组X，需要计算所有行对之间的欧式距离，然后计算距离最小的前k行索引并返回它们(其中k> 0 ).我正在测试一个小型阵列，这就是我到目前为止所拥有的...

I'm given a 2-D numpy array X consisting of floating values and need to compute the euclidean distances between all pairs of rows, then compute the top k row indices with the smallest distances and return them (where k > 0). I'm testing with a small array and this is what I have so far...

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances

X_testing = np.asarray([[1,2,3.5],[4,1,2],[0,0,2],[3.4,1,5.6]])
test = euclidean_distances(X_testing, X_testing)
print(test)

结果是:

[[ 0.          3.5         2.6925824   3.34215499]
 [ 3.5         0.          4.12310563  3.64965752]
 [ 2.6925824   4.12310563  0.          5.05173238]
 [ 3.34215499  3.64965752  5.05173238  0.        ]]

接下来，我需要有效地计算所有成对的行之间的前k个最小距离，并以列表的形式依次返回(行1，行2，距离值)的相应k个元组.

Next, I need to efficiently compute the top k smallest distances between all pairs of rows, and return the corresponding k tuples of (row1, row2, distance_value) in order in the form of a list.

因此在上述测试用例中，如果k = 2，那么我需要返回以下内容:

So in the above test case, if k = 2, then I would need to return the following:

[(0，2，2.6925824)，(0，3，3.34215499)]

[(0, 2, 2.6925824), (0, 3, 3.34215499)]

是否有内置方法(以scipy，sklearn，numpy等方式)或其他任何方式来帮助有效地进行计算?尽管上面的测试用例很小，但实际上二维数组非常大，因此需要考虑内存和时间.谢谢

Is there a built-in way (in either scipy, sklearn, numpy, etc.), or any other way to help compute this efficiently? Although the above test case is small, in reality the 2-D array is very large so memory and time is a concern. Thanks

推荐答案

使用scipy.spatial代替sklearn(我尚未安装)，我可以获得相同的距离矩阵:

Using scipy.spatial instead of sklearn (which I haven't installed yet) I can get the same distance matrix:

In [623]: from scipy import spatial
In [624]: pdist=spatial.distance.pdist(X_testing)
In [625]: pdist
Out[625]:
array([ 3.5       ,  2.6925824 ,  3.34215499,  4.12310563,  3.64965752,
        5.05173238])
In [626]: D=spatial.distance.squareform(pdist)
In [627]: D
Out[627]:
array([[ 0.        ,  3.5       ,  2.6925824 ,  3.34215499],
       [ 3.5       ,  0.        ,  4.12310563,  3.64965752],
       [ 2.6925824 ,  4.12310563,  0.        ,  5.05173238],
       [ 3.34215499,  3.64965752,  5.05173238,  0.        ]])

pdist为简明形式，其正方形标记可通过以下方式找到

pdist is in condensed form, whose indicies in the squareform can be found with

In [629]: np.triu_indices(4,1)
Out[629]:
(array([0, 0, 0, 1, 1, 2], dtype=int32),
 array([1, 2, 3, 2, 3, 3], dtype=int32))

2个最小距离是的第1个2个值

The 2 smallest distances are the 1st 2 values of

In [630]: idx=np.argsort(pdist)
In [631]: idx
Out[631]: array([1, 2, 0, 4, 3, 5], dtype=int32)

因此，我们需要pdist中的[1,2]和triu的相应元素:

So we want [1,2] from pdist and the corresponding elements of the triu:

In [633]: pdist[idx[:2]]
Out[633]: array([ 2.6925824 ,  3.34215499])
In [634]: np.transpose(np.triu_indices(4,1))[idx[:2],:]
Out[634]:
array([[0, 2],
       [0, 3]], dtype=int32)

并收集这些值作为元组列表:

and to collect those values as a list of tuples:

In [636]: I,J = np.triu_indices(4,1)
In [637]: kbig = idx[:2]
In [638]: [(i,j,d) for i,j,d in zip(I[kbig], J[kbig], pdist[kbig])]
Out[638]: [(0, 2, 2.6925824035672519), (0, 3, 3.3421549934136805)]

到(row)列表的距离的多个数组，col，distance)

这篇关于欧几里得距离(python3，sklearn):有效地计算最接近的对及其对应的距离的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！