问题描述
原来的问题是
Short Answer
有一些短期的选项:
- 使用星火1.3.1了。
- 在bdutil部署,使用HDFS作为默认文件系统(
- default_fs = HDFS
);你仍然可以直接指定GS:在你的工作//
的路径,只是HDFS将用于中间数据和暂存文件。有在此模式下使用原始蜂巢一些小的不兼容,虽然。 - 使用原始
VAL sqlContext =新org.apache.spark.sql.SQLContext(SC)
,而不是HiveContext如果你不需要HiveContext功能。 -
git的克隆https://github.com/dennishuo/spark
并运行./ make-distribution.sh --name我的定制-spark --tgz --skip-java的测试-Pyarn -Phadoop-2.6 -Dhadoop.version = 2.6.0 -Phive -Phive-thriftserver
让你可以在你的bdutil的指定一个新的压缩包spark_env.sh
。
There are a few short-term options:
长的答案
我们已经证实,它只是体现在 fs.default.name
和 fs.defaultFS
设置为一个 GS:无论//
路径是否试图加载从 parquetFile(GS:// ......)的路径
或 parquetFile(HDFS:// ...)
,并在 fs.default.name
和 fs.defaultFS
被设置为一个HDFS路径,无论从HDFS和GCS加载数据工作正常。这也是特定于火花1.4+目前,并没有在火花1.3.1以上present
Long Answer
回归似乎在这实际上修复了之前相关的类加载问题,请的用于确定要使用的类加载器,并与上述交互致力于打破GoogleHadoopFileSystem类加载。
We've verified that it only manifests when fs.default.name
and fs.defaultFS
are set to a gs://
path regardless of whether trying to load a path from parquetFile("gs://...")
or parquetFile("hdfs://...")
, and when fs.default.name
and fs.defaultFS
are set to an HDFS path, loading data from both HDFS and from GCS works fine. This is also specific to Spark 1.4+ currently, and is not present in Spark 1.3.1 or older.
在该文件下面的线包括在 com.google一切。*
作为确实加载共享库,因为番石榴和可能protobuf的依赖关系的共同类 ,但遗憾的是GoogleHadoopFileSystem应该被加载在这种情况下,蜂巢班,就像 org.apache.hadoop.hdfs.DistributedFileSystem
。我们只是碰巧不幸共享 com.google。*
包的命名空间。
The regression appears to have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 which actually fixes a prior related classloading issue, SPARK-8368. While the fix itself is correct for normal cases, there's a method IsolatedClientLoader.isSharedClass used to determine which classloader to use, and interacts with the aforementioned commit to break GoogleHadoopFileSystem classloading.
The following lines in that file include everything under com.google.*
as a "shared class" because of Guava and possibly protobuf dependencies which are indeed loaded as shared libraries, but unfortunately GoogleHadoopFileSystem should be loaded as a "hive class" in this case, just like org.apache.hadoop.hdfs.DistributedFileSystem
. We just happen to unluckily share the com.google.*
package namespace.
这可以通过添加以下行 $ {} SPARK_INSTALL /conf/log4j.properties
验证
This can be verified by adding the following line to ${SPARK_INSTALL}/conf/log4j.properties
:
log4j.logger.org.apache.spark.sql.hive.client=DEBUG
和输出显示:
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: hive class: org.apache.hadoop.hdfs.DistributedFileSystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/DistributedFileSystem.class
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
这篇关于GoogleHadoopFileSystem不能转换为Hadoop的系统中呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!