通过阵列作为星火SQL的UDF参数

本文介绍了通过阵列作为星火SQL的UDF参数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想通过这需要一个数组作为参数的函数来转换数据帧。我的code看起来是这样的：

I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:

def getCategory(categories:Array[String], input:String): String = {
  categories(input.toInt)
}

val myArray = Array("a", "b", "c")

val myCategories =udf(getCategory _ )

val df = sqlContext.parquetFile("myfile.parquet)

val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))

不过，上火不喜欢阵列和这个脚本错误。我试过之后definining一个新的部分应用功能，然后UDF：

However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :

val newFunc = getCategory(myArray,  _:String)
val myCategories = udf(newFunc)

val df1 = df.withColumn("newCategory", myCategories(col("myInput")))

这不工作无论是作为我得到一个空指针异常，它似乎是不被认可myArray的。我如何传递一个数组作为参数传递给一个函数与数据框任何想法？

This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?

在一个单独的说明，任何解释为什么做一些简单的像用在数据帧的函数那么复杂（定义功能，重新定义为UDF，等等等等）？

On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?

推荐答案

最有可能的不是prettiest解决方案，但你可以尝试这样的事：

Most likely not the prettiest solution but you can try something like this:

def getCategory(categories: Array[String]) = {
    udf((input:String) => categories(input.toInt))
}

df.withColumn("newCategory", getCategory(myArray)(col("myInput")))

您也可以尝试一个阵列文字的：

You could also try an array of literals:

val getCategory = udf(
   (input:String, categories: Array[String]) => categories(input.toInt))

df.withColumn(
  "newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))

在使用地图而不是阵列可能是一个更好的主意一个侧面说明：

On a side note using Map instead of Array is probably a better idea:

def mapCategory(categories: Map[String, String], default: String) = {
    udf((input:String) =>  categories.getOrElse(input, default))
}

val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")

df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))

由于星火1.5.0，你也可以使用阵列功能：

import org.apache.spark.sql.functions.array

val colArray = array(myArray map(lit  _): _*)
myCategories(lit(colArray), col("myInput"))

又见

这篇关于通过阵列作为星火SQL的UDF参数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！