问题描述
给定一个稀疏矩阵列表,计算矩阵中每一列(或行)之间余弦相似度的最佳方法是什么?我宁愿不重复 n-choose-2 次.
Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.
假设输入矩阵为:
A=
[0 1 0 0 1
0 0 1 1 1
1 1 0 1 0]
稀疏表示为:
A =
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3
在 Python 中,使用矩阵输入格式很简单:
In Python, it's straightforward to work with the matrix-input format:
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
给出:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
对于全矩阵输入来说这很好,但我真的想从稀疏表示开始(由于矩阵的大小和稀疏性).关于如何最好地实现这一点的任何想法?提前致谢.
That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.
推荐答案
您可以直接使用 sklearn 计算稀疏矩阵行上的成对余弦相似度.从 0.17 版本开始,它还支持稀疏输出:
You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
print('pairwise dense output:
{}
'.format(similarities))
#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:
{}
'.format(similarities_sparse))
结果:
pairwise dense output:
[[ 1. 0.40824829 0.40824829]
[ 0.40824829 1. 0.33333333]
[ 0.40824829 0.33333333 1. ]]
pairwise sparse output:
(0, 1) 0.408248290464
(0, 2) 0.408248290464
(0, 0) 1.0
(1, 0) 0.408248290464
(1, 2) 0.333333333333
(1, 1) 1.0
(2, 1) 0.333333333333
(2, 0) 0.408248290464
(2, 2) 1.0
如果您想要逐列的余弦相似度,只需事先转置您的输入矩阵:
If you want column-wise cosine similarities simply transpose your input matrix beforehand:
A_sparse.transpose()
这篇关于在给定稀疏矩阵数据的情况下,Python 中计算余弦相似度的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!