问题描述
我有一个存储在 BigQuery 表中的大型数据集,我想将其加载到 pypark RDD 中以进行 ETL 数据处理.
I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
我意识到 BigQuery 支持 Hadoop 输入/输出格式
I realized that BigQuery supports the Hadoop Input / Output format
https://cloud.google.com/hadoop/writing-with-bigquery-connector
和 pyspark 应该能够使用这个接口来通过使用newAPIHadoopRDD"方法创建一个 RDD.
and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD".
http://spark.apache.org/docs/latest/api/python/pyspark.html
不幸的是,两端的文档似乎很少,超出了我对 Hadoop/Spark/BigQuery 的了解.有没有人想出如何做到这一点?
Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this?
推荐答案
Google 现在有一个 示例.
Google now has an example on how to use the BigQuery connector with Spark.
使用 GsonBigQueryInputFormat 似乎有问题,但我得到了一个简单的莎士比亚字数统计示例
There does seem to be a problem using the GsonBigQueryInputFormat, but I got a simple Shakespeare word counting example working
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD("com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat", "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)
这篇关于通过 Hadoop 输入格式示例用于 pyspark 的 BigQuery 连接器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!