python - 如何在pyspark中设置spark.sql.parquet.output.committer.class

我正在尝试设置spark.sql.parquet.output.committer.class，但我似乎什么也没使设置生效。

我试图让许多线程写入同一个输出文件夹，这将与org.apache.spark.sql.parquet.DirectParquetOutputCommitter一起使用，因为它不使用_temporary文件夹。我收到以下错误，这就是我知道它不起作用的方式:

Caused by: java.io.FileNotFoundException: File hdfs://path/to/stuff/_temporary/0/task_201606281757_0048_m_000029/some_dir does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)
        at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)
        at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:849)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:382)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
        at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)

请注意对默认类org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob的调用。

我已经根据其他SO答案和搜索尝试了以下方法:

sc._jsc.hadoopConfiguration().set(key, val)(这适用于parquet.enable.summary-metadata之类的设置)

dataframe.write.option(key, val).parquet

将--conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter"添加到spark-submit调用

将--conf "spark.sql.parquet.output.committer.class"=" org.apache.spark.sql.parquet.DirectParquetOutputCommitter"添加到spark-submit调用中。

这就是我能够找到的所有内容，但没有任何效果。看起来set in Scala并不难，但在Python中似乎是不可能的。

最佳答案

this comment中的方法绝对适用于我:

16/06/28 18:49:59 INFO ParquetRelation: Using user defined output committer for Parquet: org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter

这是Spark发出的洪水中的一条丢失的日志消息，与我看到的错误无关。无论如何，这都是有争议的，因为DirectParquetOutputCommitter已经是removed from Spark。

关于python - 如何在pyspark中设置spark.sql.parquet.output.committer.class，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/38083563/