本文介绍了scipy成对距离与X.X + Y.Y-X.Y ^ t之间的差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有数据

d1 = np.random.uniform(low=0, high=2, size=(3,2))
d2 = np.random.uniform(low=3, high=5, size=(3,2))
X = np.vstack((d1,d2))

X
array([[ 1.4930674 ,  1.64890721],
       [ 0.40456265,  0.62262546],
       [ 0.86893397,  1.3590808 ],
       [ 4.04177045,  4.40938126],
       [ 3.01396153,  4.60005842],
       [ 3.2144552 ,  4.65539323]])

我想比较两种生成成对距离的方法:

I want to compare two methods for generating the pairwise distances:

假设X和Y相同:

(X-Y)^2 = X.X + Y.Y - 2*X.Y^t

这是第一种方法,因为它在scikit-learn中用于计算成对距离,随后用于核矩阵.

Here is the first method as it is used in scikit-learn for computing the pairwise distance, and later for kernel matrix.

import numpy as np
def cal_pdist1(X):
       Y = X
       XX = np.einsum('ij,ij->i', X, X)[np.newaxis, :]
       YY = XX.T
       distances = -2*np.dot(X, Y.T)
       distances += XX
       distances += YY
       return(distances)

cal_pdist1(X)
array([[  0.        ,   2.2380968 ,   0.47354188,  14.11610424,
         11.02241244,  12.00213414],
       [  2.2380968 ,   0.        ,   0.75800718,  27.56880003,
         22.62893544,  24.15871196],
       [  0.47354188,   0.75800718,   0.        ,  19.37122424,
         15.1050792 ,  16.36714548],
       [ 14.11610424,  27.56880003,  19.37122424,   0.        ,
          1.09274896,   0.74497242],
       [ 11.02241244,  22.62893544,  15.1050792 ,   1.09274896,
          0.        ,   0.04325965],
       [ 12.00213414,  24.15871196,  16.36714548,   0.74497242,
          0.04325965,   0.        ]])

现在,如果我使用如下scipy成对距离函数,我将得到

Now, if I use scipy pairwise distance function as below, I get

import scipy, scipy.spatial
pd_sparse = scipy.spatial.distance.pdist(X, metric='seuclidean')
scipy.spatial.distance.squareform(pd_sparse)
array([[ 0.        ,  0.92916653,  0.45646989,  2.29444795,  1.89740167,
         2.00059442],
       [ 0.92916653,  0.        ,  0.50798432,  3.22211357,  2.78788236,
         2.90062103],
       [ 0.45646989,  0.50798432,  0.        ,  2.72720831,  2.28001564,
         2.39338343],
       [ 2.29444795,  3.22211357,  2.72720831,  0.        ,  0.71411943,
         0.58399694],
       [ 1.89740167,  2.78788236,  2.28001564,  0.71411943,  0.        ,
         0.14102567],
       [ 2.00059442,  2.90062103,  2.39338343,  0.58399694,  0.14102567,
         0.        ]])

结果完全不同!他们不应该一样吗?

The results are completely different! Shouldn't they be the same?

推荐答案

pdist(..., metric='seuclidean')计算标准化欧几里德距离,而不是平方欧几里德距离(这是cal_pdist返回).

pdist(..., metric='seuclidean') computes the standardized Euclidean distance, not the squared Euclidean distance (which is what cal_pdist returns).

来自文档:

计算标准化的欧几里得距离.两个n向量uv之间的标准欧几里得距离是

Computes the standardized Euclidean distance. The standardized Euclidean distance between two n-vectors u and v is

   __________________
  √∑(ui−vi)^2 / V[xi]

V是方差矢量; V[i]是在点的所有i个分量上计算出的方差.如果未通过,则会自动进行计算.

V is the variance vector; V[i] is the variance computed over all the i’th components of the points. If not passed, it is automatically computed.

尝试传递metric='sqeuclidean',您将看到两个函数在舍入误差内都返回相同的结果.

Try passing metric='sqeuclidean', and you will see that both functions return the same result to within rounding error.

这篇关于scipy成对距离与X.X + Y.Y-X.Y ^ t之间的差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 03:44
查看更多