本文介绍了具有自定义距离指标的"KD树"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用"KDtree"(这是最好的选择.其他"KNN"算法对我的项目而言不是最佳选择)与自定义距离指标一起使用.我在这里检查了一些类似问题的答案,这应该可以...但是没有.

I want to use 'KDtree'(this is the best option. Other 'KNN' algorithms aren't optimal for my project) with custom distance metric. I checked some answers here for similar questions, and this should work...but doesn't.

distance_matrix是对称的,根据定义应如此:

distance_matrix is symetric as should be by definition:

array([[ 1.,  0.,  5.,  5.,  0.,  3.,  2.],
   [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
   [ 5.,  0.,  1.,  5.,  0.,  2.,  3.],
   [ 5.,  0.,  5.,  1.,  0.,  4.,  4.],
   [ 0.,  0.,  0.,  0.,  1.,  0.,  0.],
   [ 3.,  0.,  2.,  4.,  0.,  1.,  0.],
   [ 2.,  0.,  3.,  4.,  0.,  0.,  1.]])

我知道我的指标不是正式指标",但是在文档中,说,仅当我使用球树"(在User-defined distance:下)时,我的函数才必须是正式度量".这是我的代码:

I know my metric is not 'formally metric', but in documentation it says that my function has to be 'formally metric', only when I'm using 'ball tree'(under User-defined distance:).Here is my code:

from sklearn.neighbors import DistanceMetric
def dist(x, y):
    dist = 0
    for elt_x, elt_y in zip(x, y):
        dist += distance_matrix[elt_x, elt_y]
    return dist
X = np.array([[1,0], [1,2], [1,3]])
tree = KDtree(X, metric=dist)

我收到此错误:

NameError
Traceback (most recent call last)
<ipython-input-27-b5fac7810091> in <module>()
  7     return dist
  8 X = np.array([[1,0], [1,2], [1,3]])
----> 9 tree = KDtree(X, metric=dist)
NameError: name 'KDtree' is not defined

我也尝试过:

from sklearn.neighbors import KDTree
def dist(x, y):
    dist = 0
    for elt_x, elt_y in zip(x, y):
        dist += distance_matrix[elt_x, elt_y]
    return dist
X = np.array([[1,0], [1,2], [1,3]])
tree = KDTree(X, metric=lambda a,b: dist(a,b))

我收到此错误:

ValueError
Traceback (most recent call last)
<ipython-input-27-b5fac7810091> in <module>()
  7     return dist
  8 X = np.array([[1,0], [1,2], [1,3]])
----> 9 tree = KDtree(X, metric=dist)
ValueError: metric PyFuncDistance is not valid for KDTree

我也尝试过:

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, algorithm='kd_tree',    metric=dist_metric)

我收到以下错误:

ValueError                                Traceback (most recent call last)
<ipython-input-32-c78d02cacb5a> in <module>()
      1 from sklearn.neighbors import NearestNeighbors
----> 2 nbrs = NearestNeighbors(n_neighbors=1, algorithm='kd_tree',     metric=dist_metric)

/usr/local/lib/python3.5/dist-packages/sklearn/neighbors/unsupervised.py    in __init__(self, n_neighbors, radius, algorithm, leaf_size, metric, p, metric_params, n_jobs, **kwargs)
    121                           algorithm=algorithm,
    122                           leaf_size=leaf_size, metric=metric, p=p,
--> 123                           metric_params=metric_params,     n_jobs=n_jobs, **kwargs)

/usr/local/lib/python3.5/dist-packages/sklearn/neighbors/base.py in     _init_params(self, n_neighbors, radius, algorithm, leaf_size, metric, p, metric_params, n_jobs)
    138                 raise ValueError(
    139                     "kd_tree algorithm does not support callable     metric '%s'"
--> 140                     % metric)
     141         elif metric not in VALID_METRICS[alg_check]:
    142             raise ValueError("Metric '%s' not valid for algorithm     '%s'"

ValueError: kd_tree algorithm does not support callable metric '<function     dist_metric at 0x7f58c2b3fd08>'

我尝试了所有其他算法(自动,暴力,...),但是它发出了相同的错误.

I tried all other algorithms (auto, brute,...), but it puts out same error.

我必须对向量的元素使用距离矩阵,因为元素是特征代码,而5可以比3更接近1.我需要的是获得前3个邻居(从最接近的邻居开始排序).

I have to use distance matrix for elements of vectors as element is code for characteristics, and 5 can be closer to 1 than is 3. What I need is to get top 3 neighbors(sorted from closest to furthest).

推荐答案

Scikit-learn的KDTree不支持自定义距离度量. BallTree确实支持自定义距离度量,但要小心:由用户确定所提供的度量是实际上是一个有效指标 :如果不是,该算法将很高兴返回查询结果,但结果将是错误的.

Scikit-learn's KDTree does not support custom distance metrics. The BallTree does support custom distance metrics, but be careful: it is up to the user to make certain the provided metric is actually a valid metric: if it is not, the algorithm will happily return results of a query, but the results will be incorrect.

另外,您应该意识到,使用自定义Python函数作为度量标准通常太慢而无法使用,因为遍历树中的Python回调会产生开销.

Also, you should be aware that using a custom Python function as a metric is generally too slow to be useful, because of the overhead of Python callbacks within the traversal of the tree.

这篇关于具有自定义距离指标的"KD树"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 03:25