问题描述
我有:
A_RDD = anRDD.map()
B_RDD = A_RDD.aggregateByKey()
好的,我的问题是:
如果我在A_RDD之后放置partitionBy(new HashPartitioner),如下所示:
If i put partitionBy(new HashPartitioner) after A_RDD like :
A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))
B_RDD = A_RDD.aggregateByKey()
1)首先,这会和我将其保持原样一样有效率吗?AggregateByKey()将使用A_RDD中的hashPartitioner,对吗?
1)Will this be the same efficient as if i leave it as it is, in the first place?aggregateByKey() will use that hashPartitioner from A_RDD, right?
2)或者,如果我像在第一个示例中那样保留它,则aggregateByKey()将首先按键聚合每个分区,然后以更多的方式发送每个已聚合"(键,值)对正确的分区的有效方法?
2)Or If i leave it as in the first example,aggregateByKey() will aggregate every partition by key first, and then send every "aggregated" (key, value) pair in a more efficient way to the right partition?
3)为什么RDD上的map,flatMap和其他转换不能接受关于如何动态分配(键,值)对的争论?我的意思是例如在每个元组的map()操作期间,让=>将此元组也发送到特定分区已由地图e.x上的partitioner参数指定的地图:map(,Partitioner).
3)Why doesn't map,flatMap and other transformations on RDDs canNOT take an argument on how to partition the (key, value) pairs on the fly?What I mean is for example during the map() operation on every tuple lets say, => to send also this tuple to a specific partitionthat has been designated by a partitioner argument on map e.x: map( , Partitioner).
我正在尝试掌握AggregateByKey()的工作原理,但是每当我认为得到这一点时,就会出现一个新问题……预先感谢.
I am trying to grasp the concept of aggregateByKey() how it works, but every time i think i got this, a new question arises...Thanks in advance.
推荐答案
- 如果将
partitionBy
放在aggregateByKey
之前,通常效率不如单独使用aggregateByKey
.您可以有效地禁用地图侧合并. - 如果您离开,将会有地图侧联合收割机,而且通常更高效.
- 非改组操作不会占用分区,因为没有数据移动.操作在每台计算机上本地执行.
- If you put
partitionBy
beforeaggregateByKey
it typically will be less efficient thanaggregateByKey
alone. You effectively disable map side combine. - If you leave there will be map side combine and it is typically more efficient.
- Non shuffling operations don't take partitioner because there is no data movement. Operations are performed locally on each machine.
这篇关于AggregateByKey分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!