本文介绍了从本地 spark 实例读取 google storage gs://文件系统的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题很简单:您有一个本地 spark 实例(集群或只是在本地模式下运行)并且您想从 gs://

解决方案

我在这里提交我通过结合不同资源得出的解决方案:

  1. 下载谷歌云存储连接器:gs-connector 并将其存储在 $SPARK/jars/ 文件夹中(检查底部的 Alternative 1)

  2. 从 ,或者从下面复制.这是hadoop使用的配置文件,(spark使用的).

  3. core-site.xml 文件存储在一个文件夹中.我个人创建 $SPARK/conf/hadoop/conf/ 文件夹并将其存储在那里.

  4. 在 spark-env.sh 文件中,通过添加以下行来指示 hadoop conf fodler:export HADOOP_CONF_DIR==</absolute/path/to/hadoop/conf/>

  5. 从相应的 Google 页面(Google Console-> API-Manager-> Credentials)创建一个 OAUTH2 密钥.

  6. 将凭据复制到 core-site.xml 文件.

备选方案 1:无需将文件复制到 $SPARK/jars 文件夹,您可以将 jar 存储在任何文件夹中,并将该文件夹添加到 spark 类路径中.一种方法是在 spark-env.sh``文件夹中编辑 SPARK_CLASSPATH,但 SPARK_CLASSPATH` 现在已弃用.因此,您可以查看此处,了解如何在spark类路径

<财产><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value><description>注册GCS Hadoop文件系统</description></属性><财产><name>fs.gs.auth.service.account.enable</name><value>false</value><description>强制 OAuth2 流程</description></属性><财产><name>fs.gs.auth.client.id</name><value>32555940559.apps.googleusercontent.com</value><description>与 Cloud SDK 关联的 Google 管理项目的客户端 ID

The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://

解决方案

I am submitting here the solution I have come up with by combining different resources:

  1. Download the google cloud storage connector : gs-connector and store it in $SPARK/jars/ folder (Check Alternative 1 at the bottom)

  2. Download the core-site.xml file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).

  3. Store the core-site.xml file in a folder. Personally I create the $SPARK/conf/hadoop/conf/ folder and store it there.

  4. In the spark-env.sh file indicate the hadoop conf fodler by adding the following line: export HADOOP_CONF_DIR==</absolute/path/to/hadoop/conf/>

  5. Create an OAUTH2 key from the respective page of Google (Google Console-> API-Manager-> Credentials).

  6. Copy the credentials to the core-site.xml file.

Alternative 1: Instead of copying the file to the $SPARK/jars folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to edit SPARK_CLASSPATH in the spark-env.sh``folder butSPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpath

<configuration>
    <property>
        <name>fs.gs.impl</name>
        <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
        <description>Register GCS Hadoop filesystem</description>
    </property>
    <property>
        <name>fs.gs.auth.service.account.enable</name>
        <value>false</value>
        <description>Force OAuth2 flow</description>
     </property>
     <property>
        <name>fs.gs.auth.client.id</name>
        <value>32555940559.apps.googleusercontent.com</value>
        <description>Client id of Google-managed project associated with the Cloud SDK</description>
     </property>
     <property>
        <name>fs.gs.auth.client.secret</name>
        <value>fslkfjlsdfj098ejkjhsdf</value>
        <description>Client secret of Google-managed project associated with the Cloud SDK</description>
     </property>
     <property>
        <name>fs.gs.project.id</name>
        <value>_THIS_VALUE_DOES_NOT_MATTER_</value>
        <description>This value is required by GCS connector, but not used in the tools provided here.
  The value provided is actually an invalid project id (starts with `_`).
      </description>
   </property>
</configuration>

这篇关于从本地 spark 实例读取 google storage gs://文件系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 16:35