问题描述
Spark现在提供可在数据帧中使用的预定义功能,而且似乎已对其进行了高度优化.我最初的问题是哪个更快,但是我自己进行了一些测试,发现spark函数至少在一次实例中快了大约10倍.有谁知道为什么会这样,以及什么时候udf会更快(仅适用于存在相同spark函数的实例)?
Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times faster at least in one instance. Does anyone know why this is so, and when would a udf be faster (only for instances that an identical spark function exists)?
这是我的测试代码(在Databricks社区版上运行):
Here is my testing code (ran on Databricks community ed):
# UDF vs Spark function
from faker import Factory
from pyspark.sql.functions import lit, concat
fake = Factory.create()
fake.seed(4321)
# Each entry consists of last_name, first_name, ssn, job, and age (at least 1)
from pyspark.sql import Row
def fake_entry():
name = fake.name().split()
return (name[1], name[0], fake.ssn(), fake.job(), abs(2016 - fake.date_time().year) + 1)
# Create a helper function to call a function repeatedly
def repeat(times, func, *args, **kwargs):
for _ in xrange(times):
yield func(*args, **kwargs)
data = list(repeat(500000, fake_entry))
print len(data)
data[0]
dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'ssn', 'occupation', 'age'))
dataDF.cache()
UDF功能:
concat_s = udf(lambda s: s+ 's')
udfData = dataDF.select(concat_s(dataDF.first_name).alias('name'))
udfData.count()
火花功能:
spfData = dataDF.select(concat(dataDF.first_name, lit('s')).alias('name'))
spfData.count()
多次运行,udf通常花费大约1.1-1.4 s,Spark concat
函数总是花费0.15 s以下.
Ran both multiple times, the udf usually took about 1.1 - 1.4 s, and the Spark concat
function always took under 0.15 s.
推荐答案
如果您询问有关Python UDF的问题,答案可能永远不会*.由于SQL函数相对简单,并且不是为复杂任务而设计的,因此几乎不可能补偿在Python解释器和JVM之间重复序列化,反序列化和数据移动的成本.
If you ask about Python UDF the answer is probably never*. Since SQL functions are relatively simple and are not designed for complex tasks it is pretty much impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM.
上面已经列举了主要原因,可以将其简化为一个简单的事实,即Spark DataFrame
本质上是JVM结构,而标准访问方法是通过对Java API的简单调用实现的.另一方面,UDF是用Python实现的,并且要求来回移动数据.
The main reasons are already enumerated above and can be reduced to a simple fact that Spark DataFrame
is natively a JVM structure and standard access methods are implemented by simple calls to Java API. UDF from the other hand are implemented in Python and require moving data back and forth.
虽然PySpark通常需要在JVM和Python之间进行数据移动,但是在使用低级RDD API的情况下,通常不需要昂贵的Serde活动. Spark SQL增加了序列化和序列化的额外成本,以及在JVM上从不安全的表示移入和移出数据的成本.后一种适用于所有UDF(Python,Scala和Java),而前一种则适用于非本机语言.
While PySpark in general requires data movements between JVM and Python, in case of low level RDD API it typically doesn't require expensive serde activity. Spark SQL adds additional cost of serialization and serialization as well cost of moving data from and to unsafe representation on JVM. The later one is specific to all UDFs (Python, Scala and Java) but the former one is specific to non-native languages.
与UDF不同,Spark SQL函数直接在JVM上运行,并且通常与Catalyst和Tungsten很好地集成在一起.这意味着可以在执行计划中对它们进行优化,并且大多数时候可以受益于codgen和其他钨优化.而且,它们可以以本机"表示形式对数据进行操作.
Unlike UDFs, Spark SQL functions operate directly on JVM and typically are well integrated with both Catalyst and Tungsten. It means these can be optimized in the execution plan and most of the time can benefit from codgen and other Tungsten optimizations. Moreover these can operate on data in its "native" representation.
从某种意义上讲,这里的问题是Python UDF必须将数据带入代码中,而SQL表达式则相反.
So in a sense the problem here is that Python UDF has to bring data to the code while SQL expressions go the other way around.
*根据粗糙估计 PySpark窗口UDF可以胜过Scala窗口功能.
* According to rough estimates PySpark window UDF can beat Scala window function.
这篇关于Spark功能vs UDF性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!