定义数值类型SparkSQL斯卡拉功能

本文介绍了定义数值类型SparkSQL斯卡拉功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经定义了以下功能UDF SparkSQL注册：

I have defined the following function to register as UDF SparkSQL:

def array_sum(x: WrappedArray[Long]): Long= {
    x.sum
}

我想，这个函数接收作为参数任何数值类型的作品。我试过如下：

I would like that this function works with any numeric type that receives as argument. I tried the following:

import Numeric.Implicits._ 
import scala.reflect.ClassTag

def array_sum(x: WrappedArray[NumericType]) = {
   x.sum
}

但它不工作。有任何想法吗？谢谢！

But it does not work. Any ideas? Thank you!

推荐答案

NumericType 是星火SQL特定的，从来没有接触到接收标准的Scala对象的UDF。因此，最有可能的，你想是这样的：

NumericType is Spark SQL specific and is never exposed to UDFs which receive standard Scala objects. So most likely you want something like this:

def array_sum[T : Numeric : ClassTag](x: Seq[T]) = x.sum
udf[Double, Seq[Double]](array_sum _)

虽然它看起来不像有很多在这里获得。要建立这样的事情你应该实现自定义的前pression的正确途径。

although it doesn't look like there is much to gain here. To build something like this the right way you should probably implement custom expression.

实例：

val rddDouble: RDD[(Long, Array[Double])] = sc.parallelize(Seq(1L, Array(1.0, 2.0)
val double_array_sum = udf[Double, Seq[Double]](array_sum _)
rddDouble.toDF("k", "v").select(double_array_sum($"v")).show

// +------+
// |UDF(v)|
// +------+
// |   3.0|
// +------+

val rddFloat: RDD[(Long, Array[Float])] = sc.parallelize(Seq(
  (1L, Array(1.0f, 2.0f))
))
val float_array_sum = udf[Float, Seq[Float]](array_sum _)
rddFloat.toDF("k", "v").select(float_array_sum($"v")).show

// +------+
// |UDF(v)|
// +------+
// |   3.0|
// +------+

这篇关于定义数值类型SparkSQL斯卡拉功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！