与聚合器的性能对比

与聚合器的性能对比

本文介绍了Spark 中 UDAF 与聚合器的性能对比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Spark 中编写一些注重性能的代码,并想知道我是否应该编写一个 Aggregator用户定义的聚合函数 (UDAF) 用于我对数据帧的汇总操作.

I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe.

我无法在任何地方找到任何关于这些方法有多快以及您应该在 spark 2.0+ 中使用的数据.

I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+.

推荐答案

你应该写一个 Aggregator 而不是 UserDefinedAggregateFunction 作为 UserDefinedAggregateFunction 对每一行执行低效的序列化/反序列化任务.将 UserDefinedAggregateFunction 重写为 Aggregator 可以将性能从 25%-30% 提高到 100 倍,正如 在拉取请求中将 UserDefinedAggregateFunction 替换为 Aggregator

You should write an Aggregator rather than an UserDefinedAggregateFunction as UserDefinedAggregateFunction performs inefficient serialization/deserialization tasks for each row. Rewriting an UserDefinedAggregateFunction to an Aggregator can improve performance from 25%-30% to 100x, as stated in pull request replacing UserDefinedAggregateFunction with Aggregator

由于这些性能问题,UserDefinedAggregateFunction 类已经在 Spark 3.0 中弃用

Due to those performance issues, UserDefinedAggregateFunction class has been deprecated in Spark 3.0

这篇关于Spark 中 UDAF 与聚合器的性能对比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 03:42