本文介绍了Google Cloud Dataflow-从PubSub到Parquet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Google Cloud Dataflow将Google PubSub消息写入Google Cloud Storage.PubSub消息转换为json格式,而我要执行的唯一操作是从json转换为实木复合地板文件.

I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. The PubSub messages come into json format and the only operation that I want to perform is a transformation from json to parquet file.

在官方文档中,我找到了google提供的模板,该模板从发布/订阅主题中读取数据,并将Avro文件写入指定的Cloud Storage存储桶( https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage-avro ).问题在于模板源代码是用Java编写的,而我更喜欢使用Python SDK.

In the official documentation I found a template provided by google that reads data from a Pub/Sub topic and writes Avro files into the specified Cloud Storage bucket (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage-avro). The problem is that the template source code is written in Java, while I prefer to use the Python SDK.

这些是我通常使用Dataflow和Beam进行的第一个测试,并且在线上没有很多材料可以作为提示.任何建议,链接,指导,代码段都将不胜感激.

These are the first tests I'm doing with Dataflow and Beam in general, and there's not a lot of material online to take a hint from. Any suggestions, links, guidance, piece of code would be greatly appreciated.

推荐答案

为了进一步为社区做出贡献,我在总结我们的讨论作为答案.

In order to further contribute to the community, I am summarising our discussing as an answer.

由于您是从Dataflow开始的,所以我可以指出一些有用的主题和建议:

Since you are starting with Dataflow, I can point out some useful topics and advice:

  1. PTransform Apache Beam 中的内置方法WriteToParquet()非常有用.它从记录的 PCollection 写入 Parquet 文件.另外,为了使用它并写入镶木地板文件,您将需要按照文档中的说明指定架构.此外,此文章将帮助您更好地了解如何使用此方法以及如何在Google Cloud Storage(GCS)存储桶中编写该方法.

  1. The PTransform WriteToParquet() builtin method in Apache Beam is very useful. It writes to a Parquet file from a PCollection of records. Also, in order to use it and write to a parquet file, you would need to specify the schema as indicated in the documentation. In addition, this article will help you understand better how to use this method and how to write it in a Google Cloud Storage(GCS) bucket.

Google提供了此代码,说明了如何阅读邮件从PubSub并将其写入Google Cloud Storage.此快速入门从PubSub读取消息,并将消息从每个窗口写入存储桶.

Google provides this code explaining how read messages from PubSub and write them into Google Cloud Storage. This QuickStart reads the message from PubSub and write the messages from each window to a bucket.

由于您想从PubSub中读取消息,因此将消息写到Parquet并将文件存储在GCS存储桶中,因此建议您按照以下步骤进行处理:阅读消息,写入镶木地板文件并将其存储在GCS中.

Since you want to read from PubSub, write the message to Parquet and store the file in a GCS bucket, I would advise you to do the following process as steps of your pipeline: Read your messages, write to a parquet file and store it in GCS.

我鼓励您阅读以上链接.然后,如果您还有其他问题,可以发布另一个主题以获取更多具体帮助.

I encourage you to read the above links. Then if you have any other question you can post another thread in order to get more specific help.

这篇关于Google Cloud Dataflow-从PubSub到Parquet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 03:17