本文介绍了如何在Scala Spark中使用withColumn的另一列值组成列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我正在尝试向 DataFrame 添加一个新列.此列的值是另一列的值,该列的名称取决于来自同一 DataFrame 的其他列.

I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.


|  A|  B| A_1| B_2|
|  A|  1| 0.1| 0.3|
|  B|  2| 0.2| 0.4|


|  A|  B| A_1| B_2|   C|
|  A|  1| 0.1| 0.3| 0.1|
|  B|  2| 0.2| 0.4| 0.4|

也就是说,我添加了 C 列,其值来自 A_1 或 B_2 列.源列 A_1 的名称来自连接 A 列和 B 列的值.

That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.


I know that I can add a new column based on another and a constant like this:

df.withColumn("C", $"B" + 1)


I also know that the name of the column can come from a variable like this:

val name = "A_1"
df.withColumn("C", col(name) + 1)


However, what I'd like to do is something like this:

df.withColumn("C", col(s"${col("A")}_${col("B")}"))


注意:我在 Scala 2.11 和 Spark 2.2 中编码.

NOTE: I'm coding in Scala 2.11 and Spark 2.2.


您可以通过编写 udf 函数来实现您的要求.我建议使用 udf,因为您的要求是处理 dataframe 逐行内置函数相矛盾哪些功能逐列.

You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.


val columns = df.columns


import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))


A is the first column value
B is the second column value
array is the Array of all the columns values

现在只需使用 withColumn api

Now just call the udf function using withColumn api

df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)


You should get your desired output dataframe.

这篇关于如何在Scala Spark中使用withColumn的另一列值组成列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:46