在 Spark 1.3.0 中对 DataFrame 重新分区后,我在保存到 Amazon 的 S3 时遇到了 .parquet 异常。

logsForDate
    .repartition(10)
    .saveAsParquetFile(destination) // <-- Exception here

我收到的异常(exception)是:
java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

我想知道是什么问题以及如何解决它。

最佳答案

当保存到 S3 时,我实际上可以在 EMR 上用 Spark 1.3.1 重现这个问题。

但是,保存到 HDFS 工作正常。您可以先保存到 HDFS,然后使用例如s3distcp 将文件移动到 S3。

关于apache-spark - 从 Spark 保存时出现 Parque 错误,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/29960686/

10-16 21:38