本文介绍了VectorAssembler仅输出到DenseVector吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

VectorAssembler的功能有些令人讨厌.我目前正在将一组列转换为向量,然后使用StandardScaler函数应用缩放包括的功能.但是,似乎有SPARK用于记忆原因,决定是否应使用DenseVector或SparseVector表示要素的每一行.但是,当您需要使用StandardScaler时,SparseVector的输入无效,仅允许使用DenseVectors.有人知道解决方案吗?

There is something very annoying with the function of VectorAssembler.I am currently transforming a set of columns into a single column ofvectors and then use the StandardScaler function to apply the scalingto the included features. However, there seems that SPARK for memoryreasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features.But, when you need to use StandardScaler, the input of SparseVector(s)is invalid, only DenseVectors are allowed. Does anybody know a solution to that?

我决定只改用UDF函数,这样可以将稀疏向量变成密集向量.有点傻,但是行得通.

I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.

推荐答案

您说对了,VectorAssembler根据使用较少内存的方式选择密集与稀疏输出格式.

You're right that VectorAssembler chooses dense vs sparse output format based on whichever one uses less memory.

您不需要UDF即可将SparseVector转换为DenseVector;只需使用 toArray()方法:

You don't need a UDF to convert from SparseVector to DenseVector; just use toArray() method:

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

此外,StandardScaler接受SparseVector,除非您在创建时设置了withMean=True.如果确实需要去均值,则必须从所有分量中减去一个(可能为非零)数字,这样稀疏向量就不再稀疏了.

Also, StandardScaler accepts SparseVector unless you set withMean=True at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.

这篇关于VectorAssembler仅输出到DenseVector吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-23 02:16