ogleHadoopFileSystem不能转换为Hadoop的

ogleHadoopFileSystem不能转换为Hadoop的

本文介绍了GoogleHadoopFileSystem不能转换为Hadoop的系统中呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

原来的问题是

Short Answer

有一些短期的选项:


  1. 使用星火1.3.1了。

  2. 在bdutil部署,使用HDFS作为默认文件系统( - default_fs = HDFS );你仍然可以直接指定 GS:在你的工作// 的路径,只是HDFS将用于中间数据和暂存文件。有在此模式下使用原始蜂巢一些小的不兼容,虽然。

  3. 使用原始 VAL sqlContext =新org.apache.spark.sql.SQLContext(SC),而不是HiveContext如果你不需要HiveContext功能。

  4. git的克隆https://github.com/dennishuo/spark 并运行 ./ make-distribution.sh --name我的定制-spark --tgz --skip-java的测试-Pyarn -Phadoop-2.6 -Dhadoop.version = 2.6.0 -Phive -Phive-thriftserver 让你可以在你的bdutil的指定一个新的压缩包 spark_env.sh

There are a few short-term options:

长的答案

我们已经证实,它只是体现在 fs.default.name fs.defaultFS 设置为一个 GS:无论// 路径是否试图加载从 parquetFile(GS:// ......)的路径 parquetFile(HDFS:// ...),并在 fs.default.name fs.defaultFS 被设置为一个HDFS路径,无论从HDFS和GCS加载数据工作正常。这也是特定于火花1.4+目前,并没有在火花1.3.1以上present

Long Answer

回归似乎在这实际上修复了之前相关的类加载问题,请的用于确定要使用的类加载器,并与上述交互致力于打破GoogleHadoopFileSystem类加载。

We've verified that it only manifests when fs.default.name and fs.defaultFS are set to a gs:// path regardless of whether trying to load a path from parquetFile("gs://...") or parquetFile("hdfs://..."), and when fs.default.name and fs.defaultFS are set to an HDFS path, loading data from both HDFS and from GCS works fine. This is also specific to Spark 1.4+ currently, and is not present in Spark 1.3.1 or older.

在该文件下面的线包括在 com.google一切。* 作为确实加载共享库,因为番石榴和可能protobuf的依赖关系的共同类 ,但遗憾的是GoogleHadoopFileSystem应该被加载在这种情况下,蜂巢班,就像 org.apache.hadoop.hdfs.DistributedFileSystem 。我们只是碰巧不幸共享 com.google。* 包的命名空间。

The regression appears to have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 which actually fixes a prior related classloading issue, SPARK-8368. While the fix itself is correct for normal cases, there's a method IsolatedClientLoader.isSharedClass used to determine which classloader to use, and interacts with the aforementioned commit to break GoogleHadoopFileSystem classloading.

The following lines in that file include everything under com.google.* as a "shared class" because of Guava and possibly protobuf dependencies which are indeed loaded as shared libraries, but unfortunately GoogleHadoopFileSystem should be loaded as a "hive class" in this case, just like org.apache.hadoop.hdfs.DistributedFileSystem. We just happen to unluckily share the com.google.* package namespace.

这可以通过添加以下行 $ {} SPARK_INSTALL /conf/log4j.properties 验证

This can be verified by adding the following line to ${SPARK_INSTALL}/conf/log4j.properties:

log4j.logger.org.apache.spark.sql.hive.client=DEBUG

和输出显示:

...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: hive class: org.apache.hadoop.hdfs.DistributedFileSystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/DistributedFileSystem.class
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem

这篇关于GoogleHadoopFileSystem不能转换为Hadoop的系统中呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 16:36