如何使用Java UDF向Spark数据框添加新列

本文介绍了如何使用Java UDF向Spark数据框添加新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Dataset<Row> inputDS，其中有4列，即Id, List<long> time, List<String> value, aggregateType，我想使用map函数在Dataset value_new中再添加一列，该map函数需要time，value和aggregateType将其传递给函数getAggregate(String aggregateType, List<long> time, List<String> value)，并在处理参数时返回一个双精度值.方法getAggregate返回的Double值将是新的列值，即value_new

I have a Dataset<Row> inputDS which has 4 columns namely Id, List<long> time, List<String> value, aggregateType I want to add one more column to the Dataset value_new using map function, that map function takes columns time , value and aggregateType passes that to a function getAggregate(String aggregateType, List<long> time, List<String> value) and return a double value on processing the parameters. The Double value returned by the method getAggregate will be the new column value i.e value of value_new

数据集输入DS

 +------+---+-----------+---------------------------------------------+---------------+
 |    Id| value         |     time                                   |aggregateType  |
 +------+---------------+---------------------------------------------+---------------+
 |0001  |  [1.5,3.4,4.5]| [1551502200000,1551502200000,1551502200000] | Sum           |
 +------+---------------+---------------------------------------------+---------------+

预期的数据集输出DS

 +------+---------------+---------------------------------------------+---------------+-----------+
 |    Id| value         |     time                                    |aggregateType  | value_new |
 +------+---------------+---------------------------------------------+---------------+-----------+
 |0001  |  [1.5,3.4,4.5]| [1551502200000,1551502200000,1551502200000] | Sum           |   9.4     |
 +------+---------------+---------------------------------------------+---------------+-----------+

我尝试过的代码.

 inputDS.withColumn("value_new",functions.lit(inputDS.map(new MapFunction<Row,Double>(){

 public double call(Row row){
 String aggregateType = row.getAS("aggregateType");
 List<long> timeList = row.getList("time");
 List<long> valueList= row.getList("value");

 return  getAggregate(aggregateType ,timeList,valueList);

 }}),Encoders.DOUBLE())));

错误

 Unsupported literal type class org.apache.spark.sql.Dataset [value:double]

注意:很抱歉，如果我错误地使用了map函数，请建议我是否有任何解决方法.

Note Sorry if I used map function wrongly and please suggest me if there is any workaround.

谢谢.！

lit

如何使用Java UDF向Spark数据框添加新列

问题描述

推荐答案