问题描述
以下代码使我的系统在完成之前耗尽了内存.
The code below causes my system to run out of memory before it completes.
您能建议一种在大型矩阵(例如下面的矩阵)上计算余弦相似度的更有效方法吗?
Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below?
我想针对我的原始矩阵(mat
)中的所有其他元素,为我的原始矩阵(mat
)中的65000行计算余弦相似度,以便结果是65000 x 65000矩阵,其中每个元素都是余弦相似度在原始矩阵的两行之间.
I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix (mat
) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix.
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
mat = np.random.rand(65000, 10)
sparse_mat = sparse.csr_matrix(mat)
similarities = cosine_similarity(sparse_mat)
运行最后一行后,我总是会用光内存,并且该程序会因MemoryError冻结或崩溃.无论我是在8 GB本地RAM上还是在64 GB EC2实例上运行,都会发生这种情况.
After running that last line I always run out of memory and the program either freezes or crashes with a MemoryError. This occurs whether I run on my 8 gb local RAM or on a 64 gb EC2 instance.
推荐答案
此处存在相同问题.我有一个很大的非稀疏矩阵.它恰好适合内存,但是cosine_similarity
出于未知原因而崩溃,可能是因为它们在某个地方将矩阵复制了太多次.因此,我比较了左侧"的小批行,而不是整个矩阵:
Same problem here. I've got a big, non-sparse matrix. It fits in memory just fine, but cosine_similarity
crashes for whatever unknown reason, probably because they copy the matrix one time too many somewhere. So I made it compare small batches of rows "on the left" instead of the entire matrix:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity_n_space(m1, m2, batch_size=100):
assert m1.shape[1] == m2.shape[1]
ret = np.ndarray((m1.shape[0], m2.shape[0]))
for row_i in range(0, int(m1.shape[0] / batch_size) + 1):
start = row_i * batch_size
end = min([(row_i + 1) * batch_size, m1.shape[0]])
if end <= start:
break # cause I'm too lazy to elegantly handle edge cases
rows = m1[start: end]
sim = cosine_similarity(rows, m2) # rows is O(1) size
ret[start: end] = sim
return ret
对我来说,没有崩溃; YMMV.尝试不同的批量大小以使其更快.我过去一次只能比较1行,而在我的计算机上花的时间大约是它的30倍.
No crashes for me; YMMV. Try different batch sizes to make it faster. I used to only compare 1 row at a time, and it took about 30X as long on my machine.
愚蠢而有效的健全性检查:
Stupid yet effective sanity check:
import random
while True:
m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
n = np.random.rand(random.randint(1, 100), m.shape[1])
assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))
这篇关于具有numpy的大型稀疏矩阵上的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!