问题描述
我试图找到一种将我的Spark Job的结果保存为csv文件的有效方法。我在Hadoop中使用Spark,到目前为止,我的所有文件都保存为 part-00000
。
任何想法如何使我的火花节省文件与指定的文件名?
由于Spark使用Hadoop File System API将数据写入文件,这是不可避免的。如果你这样做
pre $ r code $ rdd.saveAsTextFile(foo)
它将被保存为 foo / part-XXXXX
,RDD中的每个分区都有一个part- *文件你正在尝试保存。 RDD中的每个分区被写入单独的文件的原因是为了容错。如果编写第三个分区的任务(即 part-00002
)失败,Spark只需重新运行任务并覆盖部分写入/损坏的 00002
,对其他部分没有影响。如果他们都写信给同一个文件,那么恢复单个任务就很难了。
part-XXXXX 文件通常不会成为问题,因为它们都使用HDFS API,如果您要求它们读取foo,它们将全部读取
part-XXXXX
文件也在foo中。
I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as
part-00000
.
Any ideas how to make my spark saving to file with a specified file name?
解决方案
Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do
rdd.saveAsTextFile("foo")
It will be saved as "
foo/part-XXXXX
" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance. If the task writing 3rd partition (i.e. to part-00002
) fails, Spark simply re-run the task and overwrite the partially written/corrupted part-00002
, with no effect on other parts. If they all wrote to the same file, then it is much harder recover a single task for failures.
The
part-XXXXX
files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "foo", they will all read all the part-XXXXX
files inside foo as well.
这篇关于如何在Spark中写入CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!