问题描述
如何将Google云存储桶连接到Apache Drill.我想将Apache Drill连接到Google云存储存储桶,并从存储在这些存储桶中的文件中获取数据.
How do I connect google cloud buckets to Apache Drill. I want to connect Apache Drill to google cloud storage buckets and fetch data from the file files stored in those buckets.
我可以在core-site.xml中指定访问ID和密钥,以便连接到AWS.有没有类似的方法可以将Drill连接到Google Cloud.
I can specify access id and key in core-site.xml in order to connect to AWS. Is there a similar way to connect drill to google cloud.
推荐答案
我在这里找到了有用的答案:
I found the answer here useful: Apache Drill using Google Cloud Storage
在Google Cloud Dataproc上,您可以按照上述答案中的初始化操作进行设置.还有您可以使用的完整版本,它可以创建GCS插件,默认情况下指向您的dataproc集群创建的临时存储区.
On Google Cloud Dataproc you can set it up with an initialization action as in the answer above. There's also a complete one you can use which creates a GCS plugin for you, pointed by default at the ephemeral bucket created with your dataproc cluster.
如果您不使用Cloud Dataproc,则可以在已安装的Drill群集上执行以下操作.
If you're not using Cloud Dataproc you can do the following on your already-installed Drill cluster.
从某处获取 GCS连接器并将其放入Drill的3rdparty jars目录. GCS配置在上面的链接中有详细介绍.在dataproc上,连接器jar位于/usr/lib/hadoop中,因此上述初始化操作将执行以下操作:
Get the GCS connector from somewhere and put it in Drill's 3rdparty jars directory. GCS configuration is detailed at the link above. On dataproc the connector jar is in /usr/lib/hadoop so the above initialization action does this:
# Link GCS connector to drill jars
ln -sf /usr/lib/hadoop/lib/gcs-connector-1.6.0-hadoop2.jar $DRILL_HOME/jars/3rdparty
您还需要配置core-site.xml并将其提供给Drill.这是必需的,以便Drill知道如何连接到GCS.
You need to also configure core-site.xml and make it available to Drill. This is necessary so that Drill knows how to connect to GCS.
# Symlink core-site.xml to $DRILL_HOME/conf
ln -sf /etc/hadoop/conf/core-site.xml $DRILL_HOME/conf
根据需要启动或重新启动钻头.
Start or restart your drillbits as needed.
一旦Drill启动,您就可以创建一个指向GCS存储桶的新插件.首先写出一个包含插件配置的JSON文件:
Once Drill is up, you can create a new plugin that points to a GCS bucket. First write out a JSON file containing the plugin configuration:
export DATAPROC_BUCKET=gs://your-bucket-name
cat > /tmp/gcs_plugin.json <<EOF
{
"config": {
"connection": "$DATAPROC_BUCKET",
"enabled": true,
"formats": {
"avro": {
"type": "avro"
},
"csv": {
"delimiter": ",",
"extensions": [
"csv"
],
"type": "text"
},
"csvh": {
"delimiter": ",",
"extensions": [
"csvh"
],
"extractHeader": true,
"type": "text"
},
"json": {
"extensions": [
"json"
],
"type": "json"
},
"parquet": {
"type": "parquet"
},
"psv": {
"delimiter": "|",
"extensions": [
"tbl"
],
"type": "text"
},
"sequencefile": {
"extensions": [
"seq"
],
"type": "sequencefile"
},
"tsv": {
"delimiter": "\t",
"extensions": [
"tsv"
],
"type": "text"
}
},
"type": "file",
"workspaces": {
"root": {
"defaultInputFormat": null,
"location": "/",
"writable": false
},
"tmp": {
"defaultInputFormat": null,
"location": "/tmp",
"writable": true
}
}
},
"name": "gs"
}
EOF
然后将新插件发布到任何钻头(我假设您正在其中一个钻头上运行它):
Then POST the new plugin to any drillbit (I'm assuming you're running this on one of the drillbits):
curl -d@/tmp/gcs_plugin.json \
-H "Content-Type: application/json" \
-X POST http://localhost:8047/storage/gs.json
如果您希望Drill查询多个存储分区,我相信您需要重复此过程来更改名称(上面的"gs").
I believe you need to repeat this procedure changing the name ("gs" above) if you want Drill to query multiple buckets.
然后,您可以启动sqlline并检查是否可以查询该存储桶中的文件.
Then you can launch sqlline and check that you can query files in that bucket.
这篇关于将Apache Drill连接到Google Cloud的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!