问题描述
在 pig 中进行常规 join 时,join 中的最后一个表不会被带入内存,而是通过流式传输,因此如果 A 每个键的基数较小而 B 的基数较大,则执行 join 明显更好A, Bcode>比
join A by B
,从性能角度(避免溢出和OOM)
When doing a regular join in pig, the last table in the join is not brought into memory but streamed through instead, so if A has small cardinality per key and B large cardinality, it is significantly better to do join A, B
than join A by B
, from performance perspective (avoiding spill and OOM)
spark 中有类似的概念吗?我没有看到任何这样的建议,想知道这怎么可能?在我看来,实现与 pig 中的几乎相同:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
Is there a similar concept in spark? I didn't see any such recommendation, and wonder how is it possible? The implementation looks to me pretty much the same as in pig: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
还是我遗漏了什么?
推荐答案
这没什么区别,在 spark 中,RDD 只有在缓存时才会被带入内存.所以在spark中你可以缓存较小的RDD来达到同样的效果.你可以在 spark 中做的另一件事是我不确定 pig 能做的事情是,如果所有加入的 RDD 都具有相同的分区器,则不需要进行 shuffle.
It does not make a difference, in spark the RDD will only be brought into memory if it is cached. So in spark to achieve the same effect you can cache the smaller RDD. Another thing you can do in spark which I'm not sure that pig does, is if all RDD's being joined have the same partitioner no shuffle needs to be done.
这篇关于在 spark join 中,表顺序是否像猪一样重要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!