pySpark Columnsimilarities 的问题

本文介绍了pySpark Columnsimilarities 的问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

tl;博士如何使用pySpark比较行的相似度?

tl;drHow do I use pySpark to compare the similarity of rows?

我有一个 numpy 数组，我想在其中比较每一行的相似性

I have a numpy array where I would like to compare the similarities of each row to one another

print (pdArray)
#[[ 0.  1.  0. ...,  0.  0.  0.]
# [ 0.  0.  3. ...,  0.  0.  0.]
# [ 0.  0.  0. ...,  0.  0.  7.]
# ...,
# [ 5.  0.  0. ...,  0.  1.  0.]
# [ 0.  6.  0. ...,  0.  0.  3.]
# [ 0.  0.  0. ...,  2.  0.  0.]]

使用 scipy 我可以计算余弦相似度如下...

Using scipy I can compute cosine similarities as follow...

pyspark.__version__
# '2.2.0'

from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(pdArray)

similarities.shape
# (475, 475)

print(similarities)
array([[  1.00000000e+00,   1.52204908e-03,   8.71545594e-02, ...,
          3.97681174e-04,   7.02593036e-04,   9.90472253e-04],
       [  1.52204908e-03,   1.00000000e+00,   3.96760121e-04, ...,
          4.04724413e-03,   3.65324300e-03,   5.63519735e-04],
       [  8.71545594e-02,   3.96760121e-04,   1.00000000e+00, ...,
          2.62367141e-04,   1.87878869e-03,   8.63876439e-06],
       ...,
       [  3.97681174e-04,   4.04724413e-03,   2.62367141e-04, ...,
          1.00000000e+00,   8.05217639e-01,   2.69724702e-03],
       [  7.02593036e-04,   3.65324300e-03,   1.87878869e-03, ...,
          8.05217639e-01,   1.00000000e+00,   3.00229809e-03],
       [  9.90472253e-04,   5.63519735e-04,   8.63876439e-06, ...,
          2.69724702e-03,   3.00229809e-03,   1.00000000e+00]])

由于我希望扩展到比我原来的(475 行)矩阵大得多的集合，我希望通过 pySpark 使用 Spark

As I am looking to expand to much larger sets than my original (475 row) matrix I am looking at using Spark via pySpark

from pyspark.mllib.linalg.distributed import RowMatrix

#load data into spark
tempSpark =  sc.parallelize(pdArray)
mat = RowMatrix(tempSpark)

# Calculate exact similarities
exact = mat.columnSimilarities()

exact.entries.first()
# MatrixEntry(128, 211, 0.004969676943490767)

# Now when I get the data out I do the following...
# Convert to a RowMatrix.
rowMat = approx.toRowMatrix()
t_3 = rowMat.rows.collect()
a_3 = np.array([(x.toArray()) for x in t_3])
a_3.shape
# (488, 749)

正如您所看到的，数据的形状是 a) 不再是正方形(它应该是，并且 b)具有与原始行数不匹配的尺寸......现在它确实匹配(部分_每行中的特征 (len(pdArray[0]) = 749) 但我不知道 488 来自哪里

As you can see the shape of the data is a) no longer square (which it should be and b) has dimensions which do not match the original number of rows... now it does match (in part_ the number of features in each row (len(pdArray[0]) = 749) but I don't know where the 488 is coming from

749 的存在让我觉得我需要先转置我的数据.对吗?

The presence of 749 makes me think I need to transpose my data first. Is that correct?

最后，如果是这种情况，为什么维度不是 (749, 749) ?

Finally, if this is the case why are the dimensions not (749, 749) ?

推荐答案

首先，columnSimilarities 方法只返回相似矩阵上三角部分的非对角线条目.如果沿对角线没有 1，则结果相似度矩阵中的整行可能都为 0.

First, the columnSimilarities method only returns the off diagonal entries of the upper triangular portion of the similarity matrix. With the absence of the 1's along the diagonal, you may have 0's for entire rows in the resulting similarity matrix.

其次，pyspark RowMatrix 没有有意义的行索引.所以基本上当从 CoordinateMatrix 转换为 RowMatrix 时，MatrixEntry 中的 i 值被映射到任何很方便(可能是一些递增的索引).因此，可能发生的情况是，当您将矩阵转换为 RowMatrix 时，所有全为 0 的行会被忽略，并且矩阵会被垂直压缩.

Second, a pyspark RowMatrix doesn't have meaningful row indices. So essentially when converting from a CoordinateMatrix to a RowMatrix, the i value in the MatrixEntry is being mapped to whatever is convenient (probably some incrementing index). So what is likely happening is the rows that have all 0's are simply being ignored and the matrix is being squished vertically when you convert it to a RowMatrix.

在使用 columnSimilarities 方法计算后立即检查相似度矩阵的维度可能是有意义的.您可以通过使用 numRows() 和 numCols() 方法来做到这一点.

It probably makes sense to inspect the dimension of the similarity matrix immediately after computation with the columnSimilarities method. You can do this by using the numRows() and the numCols() methods.

print(exact.numRows(),exact.numCols())

除此之外，听起来确实需要转置矩阵以获得正确的向量相似度.此外，如果您出于某种原因需要类似 RowMatrix 的形式，您可以尝试使用 IndexedRowMatrix ，它确实具有有意义的行索引并会保留行索引从转换后的原始坐标矩阵.

Other than that, it does sound like you need to transpose your matrix to get the correct vector similarities. Furthermore, if there is some reason that you need this in a RowMatrix-like form, you could try using an IndexedRowMatrix which does have meaningful row indices and would preserve the row index from the original CoordinateMatrix upon conversion.

这篇关于pySpark Columnsimilarities 的问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！