数据框中使用groupby

数据框中使用groupby

本文介绍了我应该避免在数据集/数据框中使用groupby()吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在RDD中我们不鼓励使用groupByKey,并鼓励使用诸如reduceByKey()和aggregateByKey()之类的替代方法,因为这些其他方法会首先在每个分区上减少,然后执行groupByKey()从而减少改组的数据量.

I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled.

现在,我的问题是这是否仍然适用于数据集/数据框?我当时在想,因为催化剂引擎做了很多优化工作,所以催化剂将自动知道应该在每个分区上进行还原,然后再执行groupBy.我对么?或者,我们仍然需要采取措施以确保在groupBy之前对每个分区执行还原操作.

Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it should reduce on each partition, and then perform the groupBy. Am I correct? Or we still need to take steps to ensure reduction on each partition is performed before groupBy.

推荐答案

groupBy应该用于数据框和数据集.如果您认为是完全正确的,则Catalyst Optimizer将构建计划并优化GroupBy中的所有入口以及您想要执行的其他聚合.

The groupBy should be used at Dataframes and Datasets. You thinking is complete right, the Catalyst Optimizer will build the plan and optimize all the entrances in GroupBy and other aggregations that you want to do.

有一个很好的例子,在链接上的spark 1.4中显示了带有RDD的reduceByKey和带有DataFrame的GroupBy的比较.

There is a good example, that is in spark 1.4 on this link that show the comparison of reduceByKey with RDD and GroupBy with DataFrame.

您可以看到它确实比RDD快得多,因此groupBy通过优化所有执行以获得更多详细信息,您可以使用 DataFrames的介绍

And you can see that is really much more faster than RDD, so groupBy optimize all the execution for more details you can see the oficial post of DataBricks with the introduction of DataFrames

这篇关于我应该避免在数据集/数据框中使用groupby()吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 13:21