本文介绍了胶水作业无法写入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过胶水作业回填了一些数据.作业本身正在从s3中读取TSV,对数据进行少量转换,然后将其以Parquet形式写入S3.由于已经有了数据,因此我试图一次启动多个作业,以减少处理所有作业所需的时间.当我同时启动多个作业时,有时会遇到一个问题,其中一个文件将无法在S3中输出生成的Parquet文件.作业本身成功完成,没有引发错误.当我将作业作为非并行任务重新运行时,它会正确输出文件.是否存在胶水(或潜在火花)或S3引起我问题的问题?

I am back filling some data via glue jobs. The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. The job itself completes successfully without throwing an error When I rerun the job as a non-parallel task, the file it output correctly. Is there some issue, either with glue(or the underlying spark) or S3 that would cause my issue?

推荐答案

并行运行的同一Glue作业可能会产生具有相同名称的文件,因此其中一些文件可能会被覆盖.我记得正确,transformation-context被用作名称的一部分.我假设您没有启用书签,因此可以安全地动态生成转换上下文值,以确保该值对于每个作业都是唯一的.

The same Glue job running in parallel may produce files with the same names and therefore some of them can be overwritten. As I remember correctly, transformation-context is used as part of the name. I assume you don't have bookmarking enabled so it should be safe for you to generate transformation-context value dynamically to ensure it's unique for each job.

这篇关于胶水作业无法写入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 15:34