VectorAssembler仅输出到DenseVector吗?

本文介绍了VectorAssembler仅输出到DenseVector吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

VectorAssembler的功能有些令人讨厌.我目前正在将一组列转换为向量，然后使用StandardScaler函数应用缩放包括的功能.但是，似乎有SPARK用于记忆原因，决定是否应使用DenseVector或SparseVector表示要素的每一行.但是，当您需要使用StandardScaler时，SparseVector的输入无效，仅允许使用DenseVectors.有人知道解决方案吗?

There is something very annoying with the function of VectorAssembler.I am currently transforming a set of columns into a single column ofvectors and then use the StandardScaler function to apply the scalingto the included features. However, there seems that SPARK for memoryreasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features.But, when you need to use StandardScaler, the input of SparseVector(s)is invalid, only DenseVectors are allowed. Does anybody know a solution to that?

我决定只改用UDF函数，这样可以将稀疏向量变成密集向量.有点傻，但是行得通.

I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.

推荐答案

您说对了，VectorAssembler根据使用较少内存的方式选择密集与稀疏输出格式.

You're right that VectorAssembler chooses dense vs sparse output format based on whichever one uses less memory.

您不需要UDF即可将SparseVector转换为DenseVector；只需使用 toArray()方法:

You don't need a UDF to convert from SparseVector to DenseVector; just use toArray() method:

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

此外，StandardScaler接受SparseVector，除非您在创建时设置了withMean=True.如果确实需要去均值，则必须从所有分量中减去一个(可能为非零)数字，这样稀疏向量就不再稀疏了.

Also, StandardScaler accepts SparseVector unless you set withMean=True at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.

这篇关于VectorAssembler仅输出到DenseVector吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！