在aws-emr上执行我的Spark作业时,尝试从s3存储桶读取avro文件时出现此错误:
它发生在版本中:

  • emr-5.5.0
  • emr-5.9.0

  • 这是代码:
    val files  = 0 until numOfDaysToFetch map { i =>
      s"s3n://bravos/clicks/${fromDate.minusDays(i)}/*"
    }
    spark.read.format("com.databricks.spark.avro").load(files: _*)
    

    异常(exception):
    java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 1037330823653531755-2017-10-16T03:06:00.avro
        at org.apache.hadoop.fs.Path.initialize(Path.java:205)
        at org.apache.hadoop.fs.Path.<init>(Path.java:171)
        at org.apache.hadoop.fs.Path.<init>(Path.java:93)
        at org.apache.hadoop.fs.Globber.glob(Globber.java:241)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1732)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1713)
        at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.globStatus(EmrFileSystem.java:362)
        at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
        at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:374)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    

    `

    最佳答案

    Path不支持冒号。它将1037330823653531755-2017-10-16T03解释为:作为URI模式,然后对任何填入的“/”感到不满意。 ”

    修复:不要在文件名中使用“:”。

    关于hadoop - 使用EMR中的Spark无法从S3读取Avro,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46772856/

    10-16 06:05