问题描述
我正在尝试在 Spark 中编写一些注重性能的代码,并想知道我是否应该编写一个 Aggregator 或 用户定义的聚合函数 (UDAF) 用于我对数据帧的汇总操作.
I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe.
我无法在任何地方找到任何关于这些方法有多快以及您应该在 spark 2.0+ 中使用的数据.
I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+.
推荐答案
你应该写一个 Aggregator
而不是 UserDefinedAggregateFunction
作为 UserDefinedAggregateFunction
对每一行执行低效的序列化/反序列化任务.将 UserDefinedAggregateFunction
重写为 Aggregator
可以将性能从 25%-30% 提高到 100 倍,正如 在拉取请求中将 UserDefinedAggregateFunction
替换为 Aggregator
You should write an Aggregator
rather than an UserDefinedAggregateFunction
as UserDefinedAggregateFunction
performs inefficient serialization/deserialization tasks for each row. Rewriting an UserDefinedAggregateFunction
to an Aggregator
can improve performance from 25%-30% to 100x, as stated in pull request replacing UserDefinedAggregateFunction
with Aggregator
由于这些性能问题,UserDefinedAggregateFunction
类已经在 Spark 3.0 中弃用
Due to those performance issues, UserDefinedAggregateFunction
class has been deprecated in Spark 3.0
这篇关于Spark 中 UDAF 与聚合器的性能对比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!