问题描述
问题很简单:您有一个本地 spark 实例(集群或只是在本地模式下运行)并且您想从 gs://
我在这里提交我通过结合不同资源得出的解决方案:
下载谷歌云存储连接器:gs-connector 并将其存储在
$SPARK/jars/
文件夹中(检查底部的 Alternative 1)从 ,或者从下面复制.这是hadoop使用的配置文件,(spark使用的).
将
core-site.xml
文件存储在一个文件夹中.我个人创建$SPARK/conf/hadoop/conf/
文件夹并将其存储在那里.在 spark-env.sh 文件中,通过添加以下行来指示 hadoop conf fodler:
export HADOOP_CONF_DIR==</absolute/path/to/hadoop/conf/>
从相应的 Google 页面(
Google Console-> API-Manager-> Credentials
)创建一个 OAUTH2 密钥.将凭据复制到
core-site.xml
文件.
备选方案 1:无需将文件复制到 $SPARK/jars
文件夹,您可以将 jar 存储在任何文件夹中,并将该文件夹添加到 spark 类路径中.一种方法是在 spark-env.sh``文件夹中编辑
SPARK_CLASSPATH` 现在已弃用.因此,您可以查看此处,了解如何在spark类路径SPARK_CLASSPATH
,但
The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://
I am submitting here the solution I have come up with by combining different resources:
Download the google cloud storage connector : gs-connector and store it in
$SPARK/jars/
folder (Check Alternative 1 at the bottom)Download the
core-site.xml
file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).Store the
core-site.xml
file in a folder. Personally I create the$SPARK/conf/hadoop/conf/
folder and store it there.In the spark-env.sh file indicate the hadoop conf fodler by adding the following line:
export HADOOP_CONF_DIR==</absolute/path/to/hadoop/conf/>
Create an OAUTH2 key from the respective page of Google (
Google Console-> API-Manager-> Credentials
).Copy the credentials to the
core-site.xml
file.
Alternative 1: Instead of copying the file to the $SPARK/jars
folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to edit SPARK_CLASSPATH
in the spark-env.sh``folder but
SPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpath
<configuration>
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>Register GCS Hadoop filesystem</description>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>false</value>
<description>Force OAuth2 flow</description>
</property>
<property>
<name>fs.gs.auth.client.id</name>
<value>32555940559.apps.googleusercontent.com</value>
<description>Client id of Google-managed project associated with the Cloud SDK</description>
</property>
<property>
<name>fs.gs.auth.client.secret</name>
<value>fslkfjlsdfj098ejkjhsdf</value>
<description>Client secret of Google-managed project associated with the Cloud SDK</description>
</property>
<property>
<name>fs.gs.project.id</name>
<value>_THIS_VALUE_DOES_NOT_MATTER_</value>
<description>This value is required by GCS connector, but not used in the tools provided here.
The value provided is actually an invalid project id (starts with `_`).
</description>
</property>
</configuration>
这篇关于从本地 spark 实例读取 google storage gs://文件系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!