在 spark 2.3.0 中的结构化流中禁用 _spark_metadata

本文介绍了在 spark 2.3.0 中的结构化流中禁用 _spark_metadata的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的 Structured Streaming 应用程序正在写入 parquet，我想摆脱它创建的 _spark_metadata 文件夹.我使用了下面的属性，看起来不错

--conf "spark.hadoop.parquet.enable.summary-metadata=false"

My Structured Streaming application is writing to parquet and i want to get rid of the _spark_metadata folder its creating. I used below property and it seems fine

--conf "spark.hadoop.parquet.enable.summary-metadata=false"

当应用程序启动时，不会生成 _spark_metadata 文件夹.但是，一旦它移动到 RUNNING 状态并开始处理消息，就会失败并显示以下错误:_spark_metadata 文件夹不存在.似乎结构化流依赖于这个文件夹，没有它我们就无法运行.只是想知道在这种情况下禁用元数据属性是否有意义.这是流不是指 conf 的错误吗?

When the application starts no _spark_metadata folder is generated. But once it moves to RUNNING status and starts processing messages, it's failing with the below error saying _spark_metadata folder doesn't exist. Seems structured stream is relying on this folder without which we can't run. Just wondering if disabling metadata property makes any sense in this context. Is this a bug that the stream is not referring to the conf?

Caused by: java.io.FileNotFoundException: File /_spark_metadata does not exist.
        at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:261)
        at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1765)
        at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1761)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1761)
        at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1726)
        at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1685)
        at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.list(HDFSMetadataLog.scala:370)
        at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:231)
        at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:99)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)

中的结构化流中禁用

在 spark 2.3.0 中的结构化流中禁用 _spark_metadata

问题描述

推荐答案