如何控制从 Spark DataFrame 写入的输出文件的数量?

本文介绍了如何控制从 Spark DataFrame 写入的输出文件的数量?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 Spark 流从 Kafka 主题读取 Json 数据.
我使用 DataFrame 处理数据，稍后我希望将输出保存到 HDFS 文件.问题是使用:

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

产生许多文件，有些文件很大，有些甚至是 0 字节.

Yields many files some are large, and some are even 0 bytes.

有没有办法控制输出文件的数量?另外，为了避免相反"问题，有没有办法同时限制每个文件的大小，以便在当前达到特定大小/行数时写入一个新文件?

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

推荐答案

输出文件的数量等于 Dataset 的分区数量这意味着你可以在多个方式，取决于上下文:

The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

对于没有广泛依赖关系的数据集，您可以使用阅读器特定参数控制输入
对于具有广泛依赖关系的 Datasets，您可以使用 spark.sql.shuffle.partitions 参数控制分区数.
独立于血统，您可以合并或重新分区.

For Datasets with no wide dependencies you can control input using reader specific parameters
For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
Independent of the lineage you can coalesce or repartition.

有没有办法限制每个文件的大小，以便在当前达到特定大小/行数时写入一个新文件?

没有.内置作者是严格的 1:1 关系.

No. With built-in writers it is strictly 1:1 relationship.

这篇关于如何控制从 Spark DataFrame 写入的输出文件的数量?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！