问题描述
我正在尝试使用新的 spark 2.1 csv 选项将 DataFrame 保存到 CSV 中
i'm trying to save DataFrame into CSV using the new spark 2.1 csv option
df.select(myColumns: _*).write
.mode(SaveMode.Overwrite)
.option("header", "true")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.csv(absolutePath)
一切正常,我不介意使用 part-000XX 前缀但现在似乎添加了一些 UUID 作为后缀
everything works fine and i don't mind haivng the part-000XX prefixbut now seems like some UUID was added as a suffix
i.e
part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz
任何人都知道我如何删除此文件 ext 并只保留部分 000XX 约定
Anyone knows how i can remove this file ext and stay only with part-000XX convension
谢谢
推荐答案
您可以通过覆盖配置选项spark.sql.sources.writeJobUUID"来删除 UUID:
You can remove the UUID by overriding the configuration option "spark.sql.sources.writeJobUUID":
不幸的是,此解决方案不会完全反映旧的 saveAsTextFile 样式(即 part-00000),但可以使输出文件名更合理,例如 part-00000-output.csv.gz,其中输出"是您传递的值到 spark.sql.sources.writeJobUUID
.自动附加-"
Unfortunately this solution will not fully mirror the old saveAsTextFile style (i.e. part-00000), but could make the output file name more sane such as part-00000-output.csv.gz where "output" is the value you pass to spark.sql.sources.writeJobUUID
. The "-" is automatically appended
SPARK-8406 是相关的 Spark 问题,这里是实际的 Pull请求:https://github.com/apache/spark/pull/6864
SPARK-8406 is the relevant Spark issue and here's the actual Pull Request: https://github.com/apache/spark/pull/6864
这篇关于Spark CSV 2.1 文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!