问题描述
VectorAssembler的功能有些令人讨厌.我目前正在将一组列转换为向量,然后使用StandardScaler函数应用缩放包括的功能.但是,似乎有SPARK用于记忆原因,决定是否应使用DenseVector或SparseVector表示要素的每一行.但是,当您需要使用StandardScaler时,SparseVector的输入无效,仅允许使用DenseVectors.有人知道解决方案吗?
There is something very annoying with the function of VectorAssembler.I am currently transforming a set of columns into a single column ofvectors and then use the StandardScaler function to apply the scalingto the included features. However, there seems that SPARK for memoryreasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features.But, when you need to use StandardScaler, the input of SparseVector(s)is invalid, only DenseVectors are allowed. Does anybody know a solution to that?
我决定只改用UDF函数,这样可以将稀疏向量变成密集向量.有点傻,但是行得通.
I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.
推荐答案
您说对了,VectorAssembler
根据使用较少内存的方式选择密集与稀疏输出格式.
You're right that VectorAssembler
chooses dense vs sparse output format based on whichever one uses less memory.
您不需要UDF即可将SparseVector
转换为DenseVector
;只需使用 toArray()
方法:
You don't need a UDF to convert from SparseVector
to DenseVector
; just use toArray()
method:
from pyspark.ml.linalg import SparseVector, DenseVector
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())
此外,StandardScaler
接受SparseVector
,除非您在创建时设置了withMean=True
.如果确实需要去均值,则必须从所有分量中减去一个(可能为非零)数字,这样稀疏向量就不再稀疏了.
Also, StandardScaler
accepts SparseVector
unless you set withMean=True
at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.
这篇关于VectorAssembler仅输出到DenseVector吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!