本文介绍了数据流:无工作人员活动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在从AI Platform Notebook运行相对原始的Dataflow作业时遇到一些问题(该作业旨在从BigQuery中获取数据>清理并准备>在GCS中写入CSV):

I'm having a few problems running a relatively vanilla Dataflow job from an AI Platform Notebook (the job is meant to take data from BigQuery > cleanse and prep > write to a CSV in GCS):

options = {'staging_location': '/staging/location/',
           'temp_location': '/temp/location/',
           'job_name': 'dataflow_pipeline_job',
           'project': PROJECT,
           'teardown_policy': 'TEARDOWN_ALWAYS',
           'max_num_workers': 3,
           'region': REGION,
           'subnetwork': 'regions/<REGION>/subnetworks/<SUBNETWORK>',
           'no_save_main_session': True}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
p = beam.Pipeline('DataflowRunner', options=opts)
(p
 | 'read' >> beam.io.Read(beam.io.BigQuerySource(query=selquery, use_standard_sql=True))
 | 'csv' >> beam.FlatMap(to_csv)
 | 'out' >> beam.io.Write(beam.io.WriteToText('OUTPUT_DIR/out.csv')))
p.run()

从堆栈驱动程序返回错误:

Error returned from stackdriver:

以下警告:

不幸的是,除此之外没有其他.其他注意事项:

Unfortunately not much else other than that. Other things to note:

  • 作业在本地运行,没有任何错误
  • 网络正在自定义模式下运行,但它是默认网络
  • Python版本== 3.5.6
  • Python Apache Beam版本== 2.16.0
  • AI平台笔记本实际上是一个GCE实例,在其顶部部署了深度学习VM映像(具有容器优化的OS),然后我们使用端口转发来访问Jupyter环境
  • 请求作业的服务帐户(Compute Engine默认服务帐户)具有完成此操作所需的必要权限
  • 笔记本实例,数据流作业,GCS存储桶都位于europe-west1
  • 我还尝试在标准的AI平台笔记本电脑上运行此程序,仍然是同样的问题.
  • The job ran locally without any error
  • The network is running in custom mode but is the default network
  • Python Version == 3.5.6
  • Python Apache Beam version == 2.16.0
  • The AI Platform Notebook is infact a GCE instance with a Deep Learning VM image deployed on top (with a container optimised OS), we have then used port forwarding to access the Jupyter environment
  • The service account requesting the job (Compute Engine default service account) has the necessary permissions required to complete this
  • Notebook instance, dataflow job, GCS bucket are all in europe-west1
  • I've also tried running this on a standard AI Platform Notebook andstill the same problem.

任何帮助将不胜感激!请让我知道是否还有其他可以提供帮助的信息.

Any help would be much appreciated! Please let me know if there is any other info I can provide which will help.

我已经意识到我的错误与以下内容相同:

I've realised that my error is the same as the following:

为什么数据流步骤无法开始?

我的工作被卡住的原因是因为写入gcs步骤首先运行,即使它打算最后运行.有关如何解决此问题的任何想法?

The reason my job has gotten stuck is because the write to gcs step runs first even though it is meant to run last. Any ideas on how to fix this?

推荐答案

在代码检查中,我注意到所使用的"WriteToText转换"的语法与Apache Beam文档中建议的语法不匹配.

Upon code inspection, I noticed that the syntax of the ‘WriteToText transform’ used does not match the one suggested in the Apache beam docs.

请遵循此处.

建议的解决方法是考虑使用批处理模式下可用的BQ到CSV文件导出选项.

The suggested workaround is to consider using BQ to CSV file export option available in batch mode.

还有更多可用的导出选项.完整列表可在数据格式和压缩类型"文档中找到,此处.

There are even more export options available. The full list can be found in "the data formats and compression types" documentation here.

这篇关于数据流:无工作人员活动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-07 01:11