问题描述
假设我有一个 Spark DataFrame,我想将其另存为 CSV 文件.Spark 2.0.0之后,DataFrameWriter类直接支持保存为CSV文件.
Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.
默认行为是将输出保存在提供的路径内的多个 part-*.csv 文件中.
The default behavior is to save the output in multiple part-*.csv files inside the path provided.
我将如何保存 DF :
How would I save a DF with :
- 路径映射到确切的文件名而不是文件夹
- 标题在第一行
- 另存为单个文件而不是多个文件.
处理它的一种方法是合并 DF,然后保存文件.
One way to deal with it, is to coalesce the DF and then save the file.
df.coalesce(1).write.option("header", "true").csv("sample_file.csv")
但是这样在Master机器上收集有缺点,需要有足够内存的master.
However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.
是否可以在不使用 coalesce 的情况下编写单个 CSV 文件?如果没有,有没有比上面的代码更有效的方法?
Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?
推荐答案
我自己使用 pyspark 和 dbutils 解决了这个问题,以获取 .csv 并重命名为想要的文件名.
Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.
save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder"
file_location = save_location+'export.csv'
df.repartition(1).write.csv(path=csv_location, mode="append", header="true")
file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)
不使用 [-1] 可以改进此答案,但 .csv 似乎总是在文件夹中的最后.如果您只处理较小的文件并且可以使用 repartition(1) 或 coalesce(1),那么简单快速的解决方案.
This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).
这篇关于将 Spark DataFrame 的内容保存为单个 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!