Pyspark:重新分区与分区

本文介绍了Pyspark:重新分区与分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我现在正在研究这两个概念，并希望有所了解.通过命令行，我一直在尝试找出差异以及开发人员何时使用repartition vs partitionBy.

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy.

以下是一些示例代码:

rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1), ('b', 3), ('c',1), ('ef',5)])
rdd1 = rdd.repartition(4)
rdd2 = rdd.partitionBy(4)

rdd1.glom().collect()
[[('b', 1), ('ef', 5)], [], [], [('a', 1), ('a', 2), ('b', 3), ('c', 1)]]

rdd2.glom().collect()
[[('a', 1), ('a', 2)], [], [('c', 1)], [('b', 1), ('b', 3), ('ef', 5)]]

我研究了两者的实现，并且我注意到的大部分不同之处是，partitionBy可以使用分区功能，或者默认情况下使用Portable_hash.因此，在partitionBy中，所有相同的键应位于同一分区中.在重新分区中，我希望这些值可以在分区上更均匀地分布，但事实并非如此.

I took a look at the implementation of both, and the only difference I've noticed for the most part is that partitionBy can take a partitioning function, or using the portable_hash by default. So in partitionBy, all the same keys should be in the same partition. In repartition, I would expect the values to be distributed more evenly over the partitions, but this isnt the case.

鉴于此，为什么有人会使用分区?我想我唯一能看到它被使用的情况是，如果我不使用PairRDD，或者我有很大的数据偏斜?

Given this, why would anyone ever use repartition? I suppose the only time I could see it being used is if I'm not working with PairRDD, or I have large data skew?

有什么我想念的东西吗，或者有人可以从另一个角度为我照明吗?

Is there something that I'm missing, or could someone shed light from a different angle for me?

重新分区与分区

问题描述

推荐答案