问题描述
我的 Structured Streaming 应用程序正在写入 parquet,我想摆脱它创建的 _spark_metadata 文件夹.我使用了下面的属性,看起来不错
--conf "spark.hadoop.parquet.enable.summary-metadata=false"
My Structured Streaming application is writing to parquet and i want to get rid of the _spark_metadata folder its creating. I used below property and it seems fine
--conf "spark.hadoop.parquet.enable.summary-metadata=false"
当应用程序启动时,不会生成 _spark_metadata
文件夹.但是,一旦它移动到 RUNNING 状态并开始处理消息,就会失败并显示以下错误:_spark_metadata
文件夹不存在.似乎结构化流依赖于这个文件夹,没有它我们就无法运行.只是想知道在这种情况下禁用元数据属性是否有意义.这是流不是指 conf 的错误吗?
When the application starts no _spark_metadata
folder is generated. But once it moves to RUNNING status and starts processing messages, it's failing with the below error saying _spark_metadata
folder doesn't exist. Seems structured stream is relying on this folder without which we can't run. Just wondering if disabling metadata property makes any sense in this context. Is this a bug that the stream is not referring to the conf?
Caused by: java.io.FileNotFoundException: File /_spark_metadata does not exist.
at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:261)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1765)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1761)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1761)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1726)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1685)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.list(HDFSMetadataLog.scala:370)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:231)
at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:99)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
推荐答案
发生这种情况的原因是 kafkacheckpoint 文件夹没有被清除.kafka 检查点中的文件交叉引用 spark 元数据文件并失败.一旦我删除了它,它就开始工作
the reason this was happening is that the kafkacheckpoint folder was not cleanedup. the files inside the kafka checkpointing was cross referencing the spark metadata files and failing .once i removed both it started working
这篇关于在 spark 2.3.0 中的结构化流中禁用 _spark_metadata的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!