我应该避免在数据集/数据框中使用groupby()吗?

本文介绍了我应该避免在数据集/数据框中使用groupby()吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道在RDD中我们不鼓励使用groupByKey，并鼓励使用诸如reduceByKey()和aggregateByKey()之类的替代方法，因为这些其他方法会首先在每个分区上减少，然后执行groupByKey()从而减少改组的数据量.

I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled.

现在，我的问题是这是否仍然适用于数据集/数据框?我当时在想，因为催化剂引擎做了很多优化工作，所以催化剂将自动知道应该在每个分区上进行还原，然后再执行groupBy.我对么?或者，我们仍然需要采取措施以确保在groupBy之前对每个分区执行还原操作.

Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it should reduce on each partition, and then perform the groupBy. Am I correct? Or we still need to take steps to ensure reduction on each partition is performed before groupBy.

数据框中使用groupby

我应该避免在数据集/数据框中使用groupby()吗?

问题描述

推荐答案