问题描述
我有一个 Python Cloud Dataflow 作业,它可以在较小的子集上正常工作,但在整个数据集上似乎由于没有明显原因而失败.
I have a Python Cloud Dataflow job that works fine on smaller subsets, but seems to be failing for no obvious reasons on the complete dataset.
我在 Dataflow 界面中遇到的唯一错误是标准错误消息:
The only error I get in the Dataflow interface is the standard error message:
一个工作项尝试了 4 次,但没有成功.每次工作人员最终与服务失去联系.
分析 Stackdriver 日志仅显示此错误:
Analysing the Stackdriver logs only shows this error:
工作循环中的异常:回溯(最近一次调用最后一次):文件/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",第 736 行,运行中 deferred_exception_details=deferred_exception_details) 文件"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 590, in do_work exception_details=exception_details) 文件 "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py",第 167 行,在包装器中返回 fun(*args, **kwargs) 文件/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",第 454 行,在report_completion_status exception_details=exception_details)文件/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",第266行,在report_status work_executor=self._work_executor)文件/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 364, in report_status response = self._client.projects_jobs_workItems.ReportStatus(request) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py",第 210 行,在 ReportStatus 配置中,请求,global_params=global_params)文件/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py",第 723 行,在 _RunMethod 中返回 self.ProcessHttpResponse(method_config, http_response, request) 文件/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py",第 729 行, 在 ProcessHttpResponse self.__ProcessHttpResponse(method_config, http_response, request)) 文件/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py",第 599 行,在 __ProcessHttpResponse http_response.request_url, method_config, request) HttpError: HttpError accessing https://dataflow.googleapis.com/v1b3/projects//jobs/2017-05-03_03_33_40-3860129055041750274/workItems:reportStatus?alt=json>: response: <uslt;': '400', 'content-length': '360', 'x-xss-protection': '1;mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Wed, 03 May 2017 16:46:11 GMT', 'x-frame-options': 'SAMEORIGIN', '内容类型': '应用程序/json;charset=UTF-8'}>, content <{ "error": { "code": 400, "message": "(2a7b20b33659c46e): 发布工作更新结果失败.原因:(2a7b20b33659c523): Failed更新工作状态.原因:(8a8b13f5c3a944ba):无法更新工作状态.,(8a8b13f5c3a945d9):工作\4047499437681669251\"未租用(或租约丢失).",ID_AR":GUMENTIN}"
我认为这个 Failed to update work status
错误与 Cloud Runner 相关?但是由于我在网上没有找到有关此错误的任何信息,所以我想知道是否有人遇到过并且是否有更好的解释?
I assume this Failed to update work status
error is related to the Cloud Runner? But since I didn't find any information on this error online, I was wondering if somebody else encountered it and does have a better explanation?
我使用的是 Google Cloud Dataflow SDK for Python 0.5.5
.
推荐答案
租约到期的一个主要原因与 VM 的内存压力有关.您可以尝试在具有更高内存的机器上运行您的作业.特别是,highmem 机器类型应该可以解决问题.
One major cause of lease expirations is related to memory pressure on the VM. You may try running your job on machines with higher memory. Particularly, a highmem machine type should do the trick.
有关机器类型的更多信息,请查看GCE 文档
For more info on machine types, please check out the GCE Documentation
下一个 Dataflow 版本 (2.0.0) 应该能够更好地处理这些情况.
The next Dataflow release (2.0.0) should be able to handle these cases better.
这篇关于无法更新 Python Cloud Dataflow 中的工作状态异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!