本文介绍了Spark 函数与 UDF 性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 现在提供可在数据帧中使用的预定义函数,而且它们似乎已经过高度优化.我最初的问题是关于哪个更快,但我自己做了一些测试,发现 spark 函数至少在一个实例中快了 10 倍.有谁知道为什么会这样,udf 什么时候会更快(仅适用于存在相同火花函数的情况)?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times faster at least in one instance. Does anyone know why this is so, and when would a udf be faster (only for instances that an identical spark function exists)?

这是我的测试代码(在 Databricks 社区版上运行):

Here is my testing code (ran on Databricks community ed):

# UDF vs Spark function
from faker import Factory
from pyspark.sql.functions import lit, concat
fake = Factory.create()
fake.seed(4321)

# Each entry consists of last_name, first_name, ssn, job, and age (at least 1)
from pyspark.sql import Row
def fake_entry():
  name = fake.name().split()
  return (name[1], name[0], fake.ssn(), fake.job(), abs(2016 - fake.date_time().year) + 1)

# Create a helper function to call a function repeatedly
def repeat(times, func, *args, **kwargs):
    for _ in xrange(times):
        yield func(*args, **kwargs)
data = list(repeat(500000, fake_entry))
print len(data)
data[0]

dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'ssn', 'occupation', 'age'))
dataDF.cache()

UDF 函数:

concat_s = udf(lambda s: s+ 's')
udfData = dataDF.select(concat_s(dataDF.first_name).alias('name'))
udfData.count()

火花函数:

spfData = dataDF.select(concat(dataDF.first_name, lit('s')).alias('name'))
spfData.count()

多次运行,udf 通常花费大约 1.1 - 1.4 秒,而 Spark concat 函数总是花费不到 0.15 秒.

Ran both multiple times, the udf usually took about 1.1 - 1.4 s, and the Spark concat function always took under 0.15 s.

推荐答案

如果您询问 Python UDF,答案可能永远不会*.由于 SQL 函数相对简单并且不是为复杂任务设计的,因此几乎不可能补偿 Python 解释器和 JVM 之间重复序列化、反序列化和数据移动的成本.

If you ask about Python UDF the answer is probably never*. Since SQL functions are relatively simple and are not designed for complex tasks it is pretty much impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM.

有谁知道为什么会这样

上面已经列举了主要原因,可以归结为一个简单的事实,即 Spark DataFrame 本身就是一个 JVM 结构,标准访问方法是通过对 Java API 的简单调用来实现的.另一方面,UDF 是用 Python 实现的,需要来回移动数据.

The main reasons are already enumerated above and can be reduced to a simple fact that Spark DataFrame is natively a JVM structure and standard access methods are implemented by simple calls to Java API. UDF from the other hand are implemented in Python and require moving data back and forth.

虽然 PySpark 通常需要在 JVM 和 Python 之间进行数据移动,但在低级别 RDD API 的情况下,它通常不需要昂贵的 serde 活动.Spark SQL 增加了序列化和序列化的额外成本,以及在 JVM 上将数据移入和移入不安全表示的成本.后一个特定于所有 UDF(Python、Scala 和 Java),而前一个特定于非本地语言.

While PySpark in general requires data movements between JVM and Python, in case of low level RDD API it typically doesn't require expensive serde activity. Spark SQL adds additional cost of serialization and serialization as well cost of moving data from and to unsafe representation on JVM. The later one is specific to all UDFs (Python, Scala and Java) but the former one is specific to non-native languages.

与 UDF 不同,Spark SQL 函数直接在 JVM 上运行,并且通常与 Catalyst 和 Tungsten 很好地集成.这意味着这些可以在执行计划中进行优化,并且大部分时间可以从 codgen 和其他 Tungsten 优化中受益.此外,这些可以对其本机"表示形式的数据进行操作.

Unlike UDFs, Spark SQL functions operate directly on JVM and typically are well integrated with both Catalyst and Tungsten. It means these can be optimized in the execution plan and most of the time can benefit from codgen and other Tungsten optimizations. Moreover these can operate on data in its "native" representation.

所以从某种意义上说,这里的问题是 Python UDF 必须将数据带入代码,而 SQL 表达式则相反.

So in a sense the problem here is that Python UDF has to bring data to the code while SQL expressions go the other way around.

* 根据 粗略估计 PySpark 窗口 UDF 可以击败 Scala 窗口函数.

* According to rough estimates PySpark window UDF can beat Scala window function.

这篇关于Spark 函数与 UDF 性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 15:15