问题描述
是否可以创建将返回一组列的UDF?
Is it possible to create a UDF which would return the set of columns?
即具有如下数据帧:
| Feature1 | Feature2 | Feature 3 |
| 1.3 | 3.4 | 4.5 |
现在我想提取一个新特征,可以将其描述为两个元素的向量(例如,在线性回归中看到的-斜率和偏移量).所需的数据集应如下所示:
Now I would like to extract a new feature, which can be described as a vector of let's say two elements (e.g. as seen in a linear regression - slope and offset). Desired dataset shall look as follows:
| Feature1 | Feature2 | Feature 3 | Slope | Offset |
| 1.3 | 3.4 | 4.5 | 0.5 | 3 |
是否可以使用单个UDF创建多个列,还是需要遵循以下规则:每个UDF单个列"?
Is it possible to create multiple columns with single UDF or do I need to follow the rule: "single column per single UDF"?
推荐答案
结构方法
您可以将udf
函数定义为
def myFunc: (String => (String, String)) = { s => (s.toLowerCase, s.toUpperCase)}
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
并将.*
用作
val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select("Feature1", "Feature2", "Feature 3", "newCol.*")
我从udf
函数返回了Tuple2
用于测试目的(可以根据需要的多列来使用高阶元组),它将被视为struct
列.然后,您可以使用.*
在单独的列中选择所有元素,并最终对其重命名.
I have returned Tuple2
for testing purpose (higher order tuples can be used according to how many multiple columns are required) from udf
function and it would be treated as struct
column. Then you can use .*
to select all the elements in separate columns and finally rename them.
您应该将输出显示为
+--------+--------+---------+---+---+
|Feature1|Feature2|Feature 3|_1 |_2 |
+--------+--------+---------+---+---+
|1.3 |3.4 |4.5 |3.4|3.4|
+--------+--------+---------+---+---+
您可以重命名_1
和_2
数组方法
udf
函数应返回array
def myFunc: (String => Array[String]) = { s => Array("s".toLowerCase, s.toUpperCase)}
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
然后您可以选择array
的元素并使用alias
重命名它们
And the you can select elements of the array
and use alias
to rename them
val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select($"Feature1", $"Feature2", $"Feature 3", $"newCol"(0).as("Slope"), $"newCol"(1).as("Offset"))
您应该拥有
+--------+--------+---------+-----+------+
|Feature1|Feature2|Feature 3|Slope|Offset|
+--------+--------+---------+-----+------+
|1.3 |3.4 |4.5 |s |3.4 |
+--------+--------+---------+-----+------+
这篇关于如何使用UDF返回多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!