Python:两个大型numpy数组之间的余弦相似度

本文介绍了Python:两个大型numpy数组之间的余弦相似度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个numpy数组:

I have two numpy arrays:

数组1 :500,000行x 100列

Array 1: 500,000 rows x 100 cols

数组2 :160,000行x 100列

Array 2: 160,000 rows x 100 cols

我想找到 数组1中的每一行 和 数组2 之间最大的余弦相似度.换句话说，我计算数组1中第一行与数组2中所有行之间的余弦相似度，找到最大余弦相似度，然后计算数组1中第二行与数组1中所有行之间的余弦相似度.数组2，求出最大余弦相似度；并针对阵列1的其余部分执行此操作.

I would like to find the largest cosine similarity between each row in Array 1 and Array 2. In other words, I compute the cosine similarities between the first row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity, and then I compute the cosine similarities between the second row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity; and do this for the rest of Array 1.

我目前正在使用sklearn的cosine_similarity()函数并执行以下操作，但是速度非常慢.我想知道是否有一种更快的方法不涉及多处理/多线程来完成我想做的事情.另外，我拥有的数组也不稀疏.

I currently use sklearn's cosine_similarity() function and do the following, but it is extremely slow. I wonder if there is a faster way that doesn't involve multiprocessing/multithreading to accomplish what I want to do. Also, the arrays I have are not sparse.

from sklearn.metrics.pairwise import cosine_similarity as cosine

results = []
for i in range(Array1.shape[0]):
     results.append(numpy.max(cosine(Array1[None,i,:], Array2)))

推荐答案

在Python中进行迭代可能会很慢.最好总是向量化"并在数组上尽可能多地使用numpy操作，这会将工作传递给numpy的低级实现，这是快速的.

Iterating in Python can be quite slow. It's always best to "vectorise" and use numpy operations on arrays as much as possible, which pass the work to numpy's low-level implementation, which is fast.

cosine_similarity已经被矢量化.因此，理想的解决方案将只涉及cosine_similarity(A, B)，其中A和B是您的第一个和第二个数组.不幸的是，这个矩阵是500,000 x 160,000，太大了，无法在内存中处理(会引发错误).

cosine_similarity is already vectorised. An ideal solution would therefore simply involve cosine_similarity(A, B) where A and B are your first and second arrays. Unfortunately this matrix is 500,000 by 160,000 which is too large to do in memory (it throws an error).

然后，下一个最佳解决方案是将A(按行)拆分为大块(而不是单个行)，以便结果仍适合内存，并对其进行迭代.我发现，在您的数据中，每个块中使用100行适合内存.还有很多，这是行不通的.然后，我们只需使用.max并为每次迭代获取100个最大值，我们可以在最后将它们收集在一起.

The next best solution then is to split A (by rows) into large blocks (instead of individual rows) so that the result still fits in memory, and iterate over them. I find for your data that using 100 rows in each block fits in memory; much more and it doesn't work. Then we simply use .max and get our 100 maxes for each iteration, which we can collect together at the end.

这种方式强烈建议我们节省更多时间.两个向量的余弦相似度的公式为 u.v/| u || v | ，它是两个向量之间夹角的余弦.因为我们正在迭代，所以每次都会重新计算B行的长度并将结果扔掉.解决此问题的一种好方法是使用这样一个事实:如果缩放矢量(角度相同)，则余弦相似度不会改变.因此，我们只能计算一次所有的行长，然后将它们除以使行成为单位矢量.然后我们简单地以 u.v 计算余弦相似度，这可以通过矩阵乘法对数组进行.我对此进行了快速测试，速度大约快了3倍.

This way strongly suggests we do an additional time save, though. The formula for the cosine similarity of two vectors is u.v / |u||v|, and it is the cosine of the angle between the two. Because we're iterating, we keep recalculating the lengths of the rows of B each time and throwing the result away. A nice way around this is to use the fact that cosine similarity does not vary if you scale the vectors (the angle is the same). So we can calculate all the row lengths only once and divide by them to make the rows unit vectors. And then we calculate the cosine similarity simply as u.v, which can be done for arrays via matrix multiplication. I did a quick test of this and it was about 3 times faster.

将它们放在一起:

import numpy as np

# Example data
A = np.random.random([500000, 100])
B = np.random.random([160000, 100])

# There may be a proper numpy method for this function, but it won't be much faster.
def normalise(A):
    lengths = (A**2).sum(axis=1, keepdims=True)**.5
    return A/lengths

A = normalise(A)
B = normalise(B)

results = []

rows_in_slice = 100

slice_start = 0
slice_end = slice_start + rows_in_slice

while slice_end <= A.shape[0]:

    results.append(A[slice_start:slice_end].dot(B.T).max(axis=1))

    slice_start += rows_in_slice
    slice_end = slice_start + rows_in_slice

result = np.concatenate(results)

每运行1000行A大约需要2秒钟.因此，您的数据大约需要1000秒.

This takes me about 2 seconds per 1,000 rows of A to run. So it should be about 1,000 seconds for your data.

这篇关于Python:两个大型numpy数组之间的余弦相似度的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Large