本文介绍了避免在S3中使用hadoop(EMR)创建_ $ folder $键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在AWS数据管道中使用EMR活动.此EMR活动在EMR群集中运行配置单元脚本.它以dynamo DB作为输入并将数据存储在S3中.

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.

这是EMR活动中使用的EMR步骤

This is the EMR step used in EMR Activity

s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath}

其中

out.direcoryPath是:

out.direcoryPath is :

s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")}

因此,这将在S3中创建一个文件夹和一个文件. (从技术上讲,它会创建两个键2017-03-18/<some_random_number>2017-03-18_$folder$)

So this creates one folder and one file in S3. (technically speaking it creates two keys 2017-03-18/<some_random_number> and 2017-03-18_$folder$)

2017-03-18
2017-03-18_$folder$

如何避免创建这些额外的空_$folder$文件.

How to avoid creation of these extra empty _$folder$ files.

我在 https://issues.apache.org/jira/browse/HADOOP中找到了解决方案-10400 ,但我不知道如何在AWS数据管道中实现它.

I found a solution listed at https://issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline.

推荐答案

EMR似乎没有提供避免这种情况的方法.

EMR doesn't seem to provide a way to avoid this.

您可以安全地删除S3存储桶中带有后缀<directoryname>_$folder$的所有空文件.这些空文件是由Hadoop框架在运行时创建的,但是Hadoop旨在处理数据,即使这些空文件被删除了.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

它在Hadoop源代码中,因此可以修复,但显然不是在EMR中修复.

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

如果您觉得很聪明,则可以创建一个与_ $ folder $后缀匹配的S3事件通知,并触发Lambda函数在创建对象后删除它们.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.

这篇关于避免在S3中使用hadoop(EMR)创建_ $ folder $键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-14 21:51