我的结构化流应用程序正在写入镶木地板,我想摆脱其创建的_spark_metadata文件夹。我用下面的财产,这似乎很好
--conf "spark.hadoop.parquet.enable.summary-metadata=false"

当应用程序启动时,不会生成_spark_metadata文件夹。但是一旦它进入RUNNING状态并开始处理消息,它就会失败,并显示以下错误,提示_spark_metadata文件夹不存在。似乎结构化流依赖于此文件夹,没有它我们就无法运行。只是想知道在这种情况下禁用元数据属性是否有意义。这是流没有引用conf的错误吗?

Caused by: java.io.FileNotFoundException: File /_spark_metadata does not exist.
        at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:261)
        at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1765)
        at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1761)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1761)
        at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1726)
        at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1685)
        at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.list(HDFSMetadataLog.scala:370)
        at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:231)
        at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:99)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)

最佳答案

发生这种情况的原因是未清除kafkacheckpoint文件夹。卡夫卡检查点内的文件交叉引用了火花元数据文件,并且失败了。

08-25 02:35