问题描述
我正在尝试优化我的Spark应用程序工作.
I am trying to optimise my spark application job.
我试图了解以下问题的要点:在唯一键上连接DataFrame时如何避免随机播放?
I tried to understand the points from this question: How to avoid shuffles while joining DataFrames on unique keys?
-
我已确保必须进行联接操作的键分布在同一分区内(使用我的自定义分区程序).
I have made sure that the keys on which join operation has to happen are distributed within the same partition (using my custom partitioner).
我也无法进行广播加入,因为根据情况我的数据可能很大.
I also cannot do a broadcast join because my data may be come large depending on situation.
在上述问题的答案中,重新分区仅优化了联接,但我需要的是联接而没有任何束缚.在分区内的键的帮助下进行联接操作就可以了.
In the answer of above mentioned question, repartitioning only optimises the join but What I need is join WITHOUT A SHUFFLE. I am just fine with the join operation with the help of keys within the partition.
有可能吗?如果不存在类似的功能,我想实现类似joinperpartition的功能.
Is it possible? I want to implement something like joinperpartition if similar functionality does not exists.
推荐答案
这不是事实.分区不仅可以优化"联接.重新分区将 Partitioner
绑定到您的RDD,RDD是地图侧连接的关键组件.
This is not true. Repartition does not only "optimize" the join. Repartition binds a Partitioner
to your RDD, which is the key component for a map side join.
火花必须知道这一点.使用适当的API构建您的DataFrame,以使它们具有相同的 Partitioner
,而spark将负责其余的工作.
Spark must know about this. Build your DataFrames with the appropriate api's so that they have the same Partitioner
, and spark will take care of the rest.
这篇关于Spark Join *无*洗牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!