问题描述
我有一个svmlight格式的大型(100K x 30K)稀疏数据集,我按如下方式加载:
I have a large (100K by 30K) and (very) sparse dataset in svmlight format which I load as follows:
import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file
X,Y = load_svmlight_file("somefile_svm.txt")
返回稀疏的Scipy数组X
which returns a sparse scipy array X
我只需要将所有训练点的成对距离计算为
I simply need to compute the pairwise distances of all training points as
D = pdist(X)
不幸的是,scipy.spatial.distance中的距离计算实现仅适用于密集矩阵.由于数据集的大小,因此无法将pdist用作
Unfortunately, distance computation implementations in scipy.spatial.distance work only for dense matrices. Due to the size of the dataset it is infeasible to, say, use pdist as
D = pdist(X.todense())
关于此问题的任何指向稀疏矩阵距离计算实现或变通方法的指针将不胜感激.
Any pointers to sparse matrix distance computation implementations or workarounds with regards to this problem will be greatly appreciated.
非常感谢
推荐答案
在scikit-learn
中,有一个sklearn.metrics.euclidean_distances
函数可用于稀疏矩阵和密集的numpy数组.请参阅参考文档.
In scikit-learn
there is a sklearn.metrics.euclidean_distances
function that works both for sparse matrices and dense numpy arrays. See the reference documentation.
但是对于稀疏矩阵,尚未实现非欧几里德距离.
However non-euclidean distances are not yet implemented for sparse matrices.
这篇关于python/scikit-learn中距离计算的稀疏实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!