问题描述
在 Spark 的文档中,聚合器:
In Spark's documentation, Aggregator:
抽象类聚合器[-IN, BUF, OUT] 扩展可序列化
用户定义聚合的基类,可以是用于数据集操作以获取组的所有元素和将它们减少到一个值.
A base class for user-defined aggregations, which can beused in Dataset operations to take all of the elements of a group andreduce them to a single value.
UserDefinedAggregateFunction 是:
UserDefinedAggregateFunction is:
抽象类 UserDefinedAggregateFunction 扩展可序列化
实现用户自定义聚合函数的基类(UDAF).
The base class for implementing user-defined aggregate functions(UDAF).
根据 数据集聚合器 - Databricks,聚合器类似于 UDAF,但接口是根据 JVM 对象而不是 Row 表示的."
According to Dataset Aggregator - Databricks, "an Aggregator is similar to a UDAF, but the interface is expressed in terms of JVM objects instead of as a Row ."
这两个类好像很相似,除了接口的类型之外还有什么区别?
It seems these two classes are very similar, what are other differences apart from the types in the interface?
一个类似的问题是:UDAF 与 Spark 中聚合器的性能
推荐答案
除了类型之外,一个根本的区别是外部接口:
A fundamental difference, apart from types, is external interface:
Aggregator
需要一个完整的Row
(它用于强"类型的 API).UserDefinedAggregationFunction
采用一组Columns
.
Aggregator
takes a completeRow
(it is intended for "strongly" typed API).UserDefinedAggregationFunction
takes a set ofColumns
.
这使得 Aggregator
不太灵活,尽管整体 API 对用户更加友好.
This makes Aggregator
less flexible, although overall API is far more user friendly.
处理状态也有区别:
Aggregator
是有状态的.取决于其缓冲区字段的可变内部状态.UserDefinedAggregateFunction
是无状态的.缓冲区的状态是外部的.
Aggregator
is stateful. Depends on mutable internal state of its buffer field.UserDefinedAggregateFunction
is stateless. State of the buffer is external.
这篇关于spark:聚合器和UDAF有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!