Spark Join *无*洗牌

本文介绍了Spark Join *无*洗牌的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试优化我的Spark应用程序工作.

I am trying to optimise my spark application job.

我试图了解以下问题的要点:在唯一键上连接DataFrame时如何避免随机播放?

I tried to understand the points from this question: How to avoid shuffles while joining DataFrames on unique keys?

我已确保必须进行联接操作的键分布在同一分区内(使用我的自定义分区程序).

I have made sure that the keys on which join operation has to happen are distributed within the same partition (using my custom partitioner).

我也无法进行广播加入，因为根据情况我的数据可能很大.

I also cannot do a broadcast join because my data may be come large depending on situation.

在上述问题的答案中，重新分区仅优化了联接，但我需要的是联接而没有任何束缚.在分区内的键的帮助下进行联接操作就可以了.

In the answer of above mentioned question, repartitioning only optimises the join but What I need is join WITHOUT A SHUFFLE. I am just fine with the join operation with the help of keys within the partition.

有可能吗?如果不存在类似的功能，我想实现类似joinperpartition的功能.

Is it possible? I want to implement something like joinperpartition if similar functionality does not exists.

repartition

Spark Join 无洗牌

问题描述

推荐答案