问题描述
我遇到了 VectorAssembler
的一个非常奇怪的行为,我想知道是否有其他人看到过这个.
我的场景非常简单.我从 CSV
文件中解析数据,其中有一些标准的 Int
和 Double
字段,我还计算了一些额外的列.我的解析函数返回这个:
val connected = countPerChannel ++ countPerSource//两个双打数组加入(label, orderNo, pageNo, Vectors.dense(joinedCounts))
我的主函数使用这样的解析函数:
val parsedData = rawData.filter(row => row != header).map(parseLine)val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")
然后我像这样使用 VectorAssembler
:
val assembler = new VectorAssembler().setInputCols(Array("orderNo", "pageNo", "joinedCounts")).setOutputCol("功能")val assemblerData = assembler.transform(data)
因此,当我在数据进入 VectorAssembler
之前打印一行数据时,它看起来像这样:
[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]
在 VectorAssembler 的变换函数之后,我打印了同一行数据并得到了这个:
[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]
这到底是怎么回事?VectorAssembler
做了什么?我已经仔细检查了所有计算,甚至遵循了简单的 Spark 示例,但看不出我的代码有什么问题.你是否可以?
输出没有什么奇怪的.你的向量似乎有很多零元素,因此 spark
使用它的稀疏表示.
进一步解释:
您的向量似乎由 18 个元素(维度)组成.
此索引 [0,1,6,9,14,17]
包含非零元素,其顺序为 [17.0,15.0,3.0,1.0,4.0,2.0]
稀疏向量表示是一种节省计算空间的方法,因此计算起来更容易、更快.更多关于稀疏表示 此处.
当然,您可以将稀疏表示转换为密集表示,但这是有代价的.
如果您对获取功能重要性感兴趣,我建议您查看this.>
I am experiencing a very strange behaviour from VectorAssembler
and I was wondering if anyone else has seen this.
My scenario is pretty straightforward. I parse data from a CSV
file where I have some standard Int
and Double
fields and I also calculate some extra columns. My parsing function returns this:
val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))
My main function uses the parsing function like this:
val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")
I then use a VectorAssembler
like this:
val assembler = new VectorAssembler()
.setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
.setOutputCol("features")
val assemblerData = assembler.transform(data)
So when I print a row of my data before it goes into the VectorAssembler
it looks like this:
[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]
After the transform function of VectorAssembler I print the same row of data and get this:
[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]
What on earth is going on? What has the VectorAssembler
done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?
There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark
used it’s sparse representation.
To explain further :
It seems like your vector is composed of 18 elements (dimension).
This indices [0,1,6,9,14,17]
from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]
Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.
Now of course you can convert that sparse representation to a dense representation but it comes at a cost.
In case you are interested in getting feature importance, thus I advise you to take a look at this.
这篇关于Spark ML VectorAssembler 返回奇怪的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!