问题描述
当前,当我使用paritionBy写入HDFS时:DF.write.partitionBy("id")
Currently , when I use the paritionBy to write to HDFS: DF.write.partitionBy("id")
我将获得类似于(默认行为)的输出结构
I will get output structure looking like (which is the default behaviour)
../id = 1/
../id=1/
../id = 2/
../id=2/
../id = 3/
../id=3/
我想要一个看起来像这样的结构
I would like a structure looking like:
../a/
../b/
../c/
这样
if id = 1, then a
if id = 2, then b
..等等
是否可以更改文件名输出?如果不是,最好的方法是什么?
Is there a way to change the filename output? If not What is the best way to do this?
推荐答案
您将无法使用Spark的partitionBy
来实现这一目标.
You won't be able to use Spark's partitionBy
to achieve this.
相反,您必须将DataFrame
分成其组件分区,并将它们一个个保存,就像这样:
Instead, you have to break your DataFrame
into its component partitions, and save them one by one, like so:
base = ord('a') - 1
for id in range(1, 4):
DF.filter(DF['id'] == id).write.save("..." + chr(base + id))
}
或者,您可以使用Spark的partitionBy
工具写入整个数据帧,然后使用HDFS API手动重命名分区.
Alternatively, you can write the entire dataframe using Spark's partitionBy
facility, and then manually rename the partitions using HDFS APIs.
这篇关于Spark:PartitionBy,更改输出文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!