问题描述
我正在尝试向 DataFrame
添加一个新列.此列的值是另一列的值,该列的名称取决于来自同一 DataFrame
的其他列.
I'm trying to add a new column to a DataFrame
. The value of this column is the value of another column whose name depends on other columns from the same DataFrame
.
例如,鉴于此:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
我想得到这个:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
也就是说,我添加了 C 列,其值来自 A_1 或 B_2 列.源列 A_1 的名称来自连接 A 列和 B 列的值.
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
我知道我可以添加一个基于另一个和常量的新列,如下所示:
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
我也知道列的名称可以来自这样的变量:
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
但是,我想做的是这样的:
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
这不起作用.
注意:我在 Scala 2.11 和 Spark 2.2 中编码.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
推荐答案
您可以通过编写 udf
函数来实现您的要求.我建议使用 udf
,因为您的要求是处理 dataframe
逐行 与内置函数相矛盾哪些功能逐列.
You can achieve your requirement by writing a udf
function. I am suggesting udf
, as your requirement is to process dataframe
row by row contradicting to inbuilt functions which functions column by column.
但在此之前你需要列名数组
val columns = df.columns
然后写一个udf
函数为
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
哪里
A is the first column value
B is the second column value
array is the Array of all the columns values
现在只需使用 withColumn
api
Now just call the udf
function using withColumn
api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
你应该得到你想要的输出dataframe
.
You should get your desired output dataframe
.
这篇关于如何在Scala Spark中使用withColumn的另一列值组成列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!