本文介绍了Spark SQL-df.repartition和DataFrameWriter partition之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

DataFrame repartition()和DataFrameWriter <$ c $ partitionBy()方法之间有什么区别?

What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods?

我希望两者都用来基于dataframe列对数据进行分区吗?还是有什么区别?

I hope both are used to "partition data based on dataframe column"? Or is there any difference?

推荐答案

如果您运行 repartition(COL)在计算过程中更改分区-您将获得 spark.sql.shuffle.partitions (默认值:200)分区。如果随后调用 .write ,则将得到一个包含许多文件的目录。

If you run repartition(COL) you change the partitioning during calculations - you will get spark.sql.shuffle.partitions (default: 200) partitions. If you then call .write you will get one directory with many files.

如果运行 .write.partitionBy(COL),结果您将获得与COL中唯一值一样多的目录。这样可以进一步读取数据(如果您按分区列进行过滤),并节省了一些存储空间(分区列已从数据文件中删除)。

If you run .write.partitionBy(COL) then as the result you will get as many directories as unique values in COL. This speeds up futher data reading (if you filter by partitioning column) and saves some space on storage (partitioning column is removed from data files).

UPDATE :请参阅@conradlee的答案。他不仅详细说明了应用不同方法后的目录结构,而且还说明了两种情况下文件的数量。

UPDATE: See @conradlee's answer. He explains in details not only how the directories structure will look like after applying different methods but also what will be resulting number of files in both scenarios.

这篇关于Spark SQL-df.repartition和DataFrameWriter partition之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 16:43