问题描述
我正在经历VectorAssembler
的一个非常奇怪的行为,我想知道是否还有其他人看到过此情况.
I am experiencing a very strange behaviour from VectorAssembler
and I was wondering if anyone else has seen this.
我的情况非常简单.我从CSV
文件中解析数据,其中有一些标准的Int
和Double
字段,并且我还计算了一些额外的列.我的解析函数返回以下内容:
My scenario is pretty straightforward. I parse data from a CSV
file where I have some standard Int
and Double
fields and I also calculate some extra columns. My parsing function returns this:
val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))
我的主要功能使用如下解析功能:
My main function uses the parsing function like this:
val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")
然后我像这样使用VectorAssembler
:
val assembler = new VectorAssembler()
.setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
.setOutputCol("features")
val assemblerData = assembler.transform(data)
因此,当我将一行数据打印到VectorAssembler
中之前,它看起来像这样:
So when I print a row of my data before it goes into the VectorAssembler
it looks like this:
[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]
在VectorAssembler的转换功能之后,我打印了同一行数据并得到了:
After the transform function of VectorAssembler I print the same row of data and get this:
[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]
到底是怎么回事? VectorAssembler
做了什么?我仔细检查了所有计算,甚至遵循了简单的Spark示例,但看不到我的代码出了什么问题.你能?
What on earth is going on? What has the VectorAssembler
done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?
推荐答案
输出没有什么奇怪的.您的向量似乎有很多零元素,因此spark
使用了它的稀疏表示.
There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark
used it’s sparse representation.
进一步说明:
您的向量似乎由18个元素(维度)组成.
It seems like your vector is composed of 18 elements (dimension).
向量中的该索引[0,1,6,9,14,17]
包含按[17.0,15.0,3.0,1.0,4.0,2.0]
This indices [0,1,6,9,14,17]
from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]
稀疏向量表示法是一种节省计算空间的方式,因此可以更轻松,更快地进行计算.有关稀疏表示的更多信息此处.
Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.
现在,您当然可以将稀疏表示转换为密集表示,但这需要付出一定的代价.
Now of course you can convert that sparse representation to a dense representation but it comes at a cost.
如果您有兴趣了解功能的重要性,因此建议您查看此.
In case you are interested in getting feature importance, thus I advise you to take a look at this.
这篇关于Spark ML VectorAssembler返回奇怪的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!