在 PySpark 中将密集向量转换为稀疏向量 | 中将密集向量转换为稀疏向量

本文介绍了在 PySpark 中将密集向量转换为稀疏向量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有内置的方法可以从 PySpark 中的密集向量创建稀疏向量?我这样做的方式如下:

Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following:

Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector)  if j != 0 ])

即满足 [size, (index, data)] 格式.似乎有点hacky.有没有更有效的方法?

That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it?

推荐答案

import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.sql.functions import udf, col

如果您只有一个密集向量，则可以这样做:

If you have just one dense vector this will do it:

def dense_to_sparse(vector):
    return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)

dense_to_sparse(densevector)

这里的技巧是 csc_matrix.shape[1] 必须等于 1，所以转置向量.看看_convert_to_vector的来源:https:///people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html

The trick here is that csc_matrix.shape[1] has to equal 1, so transpose the vector. Have a look at the source of _convert_to_vector: https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html

更有可能的情况是你有一个带有密集向量列的 DF:

The more likely scenario is you have a DF with a column of densevectors:

to_sparse = udf(dense_to_sparse, VectorUDT())
DF.withColumn("sparse", to_sparse(col("densevector"))

这篇关于在 PySpark 中将密集向量转换为稀疏向量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！