apache-spark - 在Spark中的AWS EMR集群上处理Google Storage中的数据

如何处理Spark中AWS EMR集群上Google存储中存储的数据？
假设我有一些数据存储在gs://my-buckey/my-parquet-data中，如何从我的EMR集群中读取数据，而不必事先将数据复制到s3或下载到本地存储中？

最佳答案

首先获取有权访问您要处理的GS存储桶/对象的Google HMAC credentials
然后，将S3A文件系统(已与AWS hadoop分发 bundle 在一起)与以下hadoop配置值一起使用:

val conf = spark.sparkContext.hadoopConfiguration
conf.set("fs.s3a.access.key", "<hmac key>")
conf.set("fs.s3a.secret.key", "<hmac secret>")
conf.setBoolean("fs.s3a.path.style.access", true)
conf.set("fs.s3a.endpoint", "storage.googleapis.com")
conf.setInt("fs.s3a.list.version", 1)

然后，您可以使用s3a路径访问Google存储，如下所示:

spark.read.parquet("s3a://<google storage bucket name>/<path>)