问题描述
我有一个Spark DataFrame,其中有一个包含Vector值的列.向量值都是n维的,又具有相同的长度.我也有一个列名Array("f1", "f2", "f3", ..., "fn")
的列表,每个列名对应于向量中的一个元素.
I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn")
, each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
实现此目标的最佳方法是什么?我想到了一种方法,该方法是使用createDataFrame(Row(Features), featureNameList)
创建一个新的DataFrame,然后与旧的联接,但是它需要spark上下文才能使用createDataFrame.我只想转换现有的数据框.我也知道.withColumn("fi", value)
,但是如果n
很大怎么办?
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList)
and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value)
but what do I do if n
is large?
我是Scala和Spark的新手,因此找不到任何很好的例子.我认为这可能是一项常见的任务.我的特殊情况是我使用CountVectorizer
并希望单独恢复每一列以提高可读性,而不是仅获得向量结果.
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer
and wanted to recover each column individually for better readability instead of only having the vector result.
推荐答案
一种方法是将vector
列转换为array<double>
,然后使用getItem
提取单个元素.
One way could be to convert the vector
column to an array<double>
and then using getItem
to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+
这篇关于Scala Spark-将向量列拆分为Spark DataFrame中的单独列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!