本文介绍了Pyspark:重新分区与分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在研究这两个概念,并希望有所了解.通过命令行,我一直在尝试找出差异以及开发人员何时使用repartition vs partitionBy.

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy.

以下是一些示例代码:

rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1), ('b', 3), ('c',1), ('ef',5)])
rdd1 = rdd.repartition(4)
rdd2 = rdd.partitionBy(4)

rdd1.glom().collect()
[[('b', 1), ('ef', 5)], [], [], [('a', 1), ('a', 2), ('b', 3), ('c', 1)]]

rdd2.glom().collect()
[[('a', 1), ('a', 2)], [], [('c', 1)], [('b', 1), ('b', 3), ('ef', 5)]]

我研究了两者的实现,并且我注意到的大部分不同之处是,partitionBy可以使用分区功能,或者默认情况下使用Portable_hash.因此,在partitionBy中,所有相同的键应位于同一分区中.在重新分区中,我希望这些值可以在分区上更均匀地分布,但事实并非如此.

I took a look at the implementation of both, and the only difference I've noticed for the most part is that partitionBy can take a partitioning function, or using the portable_hash by default. So in partitionBy, all the same keys should be in the same partition. In repartition, I would expect the values to be distributed more evenly over the partitions, but this isnt the case.

鉴于此,为什么有人会使用分区?我想我唯一能看到它被使用的情况是,如果我不使用PairRDD,或者我有很大的数据偏斜?

Given this, why would anyone ever use repartition? I suppose the only time I could see it being used is if I'm not working with PairRDD, or I have large data skew?

有什么我想念的东西吗,或者有人可以从另一个角度为我照明吗?

Is there something that I'm missing, or could someone shed light from a different angle for me?

推荐答案

repartition已存在于RDD中,并且不按键(或按订购"以外的任何其他标准)处理分区.现在,PairRDD添加了键的概念,并随后添加了另一种允许按该键进行分区的方法.

repartition already exists in RDDs, and does not handle partitioning by key (or by any other criterion except Ordering). Now PairRDDs add the notion of keys and subsequently add another method that allows to partition by that key.

所以,是的,如果您的数据是键控的,则应绝对按该键进行分区,这在许多情况下是首先使用PairRDD的要点(用于联接,reduceByKey等).

So yes, if your data is keyed, you should absolutely partition by that key, which in many cases is the point of using a PairRDD in the first place (for joins, reduceByKey, and so on).

这篇关于Pyspark:重新分区与分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:40