问题描述
tl;博士如何使用pySpark比较行的相似度?
tl;drHow do I use pySpark to compare the similarity of rows?
我有一个 numpy 数组,我想在其中比较每一行的相似性
I have a numpy array where I would like to compare the similarities of each row to one another
print (pdArray)
#[[ 0. 1. 0. ..., 0. 0. 0.]
# [ 0. 0. 3. ..., 0. 0. 0.]
# [ 0. 0. 0. ..., 0. 0. 7.]
# ...,
# [ 5. 0. 0. ..., 0. 1. 0.]
# [ 0. 6. 0. ..., 0. 0. 3.]
# [ 0. 0. 0. ..., 2. 0. 0.]]
使用 scipy 我可以计算余弦相似度如下...
Using scipy I can compute cosine similarities as follow...
pyspark.__version__
# '2.2.0'
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(pdArray)
similarities.shape
# (475, 475)
print(similarities)
array([[ 1.00000000e+00, 1.52204908e-03, 8.71545594e-02, ...,
3.97681174e-04, 7.02593036e-04, 9.90472253e-04],
[ 1.52204908e-03, 1.00000000e+00, 3.96760121e-04, ...,
4.04724413e-03, 3.65324300e-03, 5.63519735e-04],
[ 8.71545594e-02, 3.96760121e-04, 1.00000000e+00, ...,
2.62367141e-04, 1.87878869e-03, 8.63876439e-06],
...,
[ 3.97681174e-04, 4.04724413e-03, 2.62367141e-04, ...,
1.00000000e+00, 8.05217639e-01, 2.69724702e-03],
[ 7.02593036e-04, 3.65324300e-03, 1.87878869e-03, ...,
8.05217639e-01, 1.00000000e+00, 3.00229809e-03],
[ 9.90472253e-04, 5.63519735e-04, 8.63876439e-06, ...,
2.69724702e-03, 3.00229809e-03, 1.00000000e+00]])
由于我希望扩展到比我原来的(475 行)矩阵大得多的集合,我希望通过 pySpark 使用 Spark
As I am looking to expand to much larger sets than my original (475 row) matrix I am looking at using Spark via pySpark
from pyspark.mllib.linalg.distributed import RowMatrix
#load data into spark
tempSpark = sc.parallelize(pdArray)
mat = RowMatrix(tempSpark)
# Calculate exact similarities
exact = mat.columnSimilarities()
exact.entries.first()
# MatrixEntry(128, 211, 0.004969676943490767)
# Now when I get the data out I do the following...
# Convert to a RowMatrix.
rowMat = approx.toRowMatrix()
t_3 = rowMat.rows.collect()
a_3 = np.array([(x.toArray()) for x in t_3])
a_3.shape
# (488, 749)
正如您所看到的,数据的形状是 a) 不再是正方形(它应该是,并且 b)具有与原始行数不匹配的尺寸......现在它确实匹配(部分_每行中的特征 (len(pdArray[0]) = 749) 但我不知道 488 来自哪里
As you can see the shape of the data is a) no longer square (which it should be and b) has dimensions which do not match the original number of rows... now it does match (in part_ the number of features in each row (len(pdArray[0]) = 749) but I don't know where the 488 is coming from
749 的存在让我觉得我需要先转置我的数据.对吗?
The presence of 749 makes me think I need to transpose my data first. Is that correct?
最后,如果是这种情况,为什么维度不是 (749, 749) ?
Finally, if this is the case why are the dimensions not (749, 749) ?
推荐答案
首先,columnSimilarities
方法只返回相似矩阵上三角部分的非对角线条目.如果沿对角线没有 1,则结果相似度矩阵中的整行可能都为 0.
First, the columnSimilarities
method only returns the off diagonal entries of the upper triangular portion of the similarity matrix. With the absence of the 1's along the diagonal, you may have 0's for entire rows in the resulting similarity matrix.
其次,pyspark RowMatrix
没有有意义的行索引.所以基本上当从 CoordinateMatrix
转换为 RowMatrix
时,MatrixEntry
中的 i
值被映射到任何很方便(可能是一些递增的索引).因此,可能发生的情况是,当您将矩阵转换为 RowMatrix
时,所有全为 0 的行会被忽略,并且矩阵会被垂直压缩.
Second, a pyspark RowMatrix
doesn't have meaningful row indices. So essentially when converting from a CoordinateMatrix
to a RowMatrix
, the i
value in the MatrixEntry
is being mapped to whatever is convenient (probably some incrementing index). So what is likely happening is the rows that have all 0's are simply being ignored and the matrix is being squished vertically when you convert it to a RowMatrix
.
在使用 columnSimilarities
方法计算后立即检查相似度矩阵的维度可能是有意义的.您可以通过使用 numRows()
和 numCols()
方法来做到这一点.
It probably makes sense to inspect the dimension of the similarity matrix immediately after computation with the columnSimilarities
method. You can do this by using the numRows()
and the numCols()
methods.
print(exact.numRows(),exact.numCols())
除此之外,听起来确实需要转置矩阵以获得正确的向量相似度.此外,如果您出于某种原因需要类似 RowMatrix
的形式,您可以尝试使用 IndexedRowMatrix
,它确实具有有意义的行索引并会保留行索引从转换后的原始坐标矩阵.
Other than that, it does sound like you need to transpose your matrix to get the correct vector similarities. Furthermore, if there is some reason that you need this in a RowMatrix
-like form, you could try using an IndexedRowMatrix
which does have meaningful row indices and would preserve the row index from the original CoordinateMatrix upon conversion.
这篇关于pySpark Columnsimilarities 的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!