本文介绍了避免使用 hadoop (EMR) 在 S3 中创建 _$folder$ 键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 AWS 数据管道中使用 EMR 活动.此 EMR 活动正在 EMR 集群中运行 hive 脚本.它以 dynamo DB 作为输入并将数据存储在 S3 中.

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.

这是 EMR 活动中使用的 EMR 步骤

This is the EMR step used in EMR Activity

s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath}

哪里

out.direcoryPath 是:

out.direcoryPath is :

s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")}

所以这会在 S3 中创建一个文件夹和一个文件.(从技术上讲,它创建了两个密钥 2017-03-18/2017-03-18_$folder$)

So this creates one folder and one file in S3. (technically speaking it creates two keys 2017-03-18/<some_random_number> and 2017-03-18_$folder$)

2017-03-18
2017-03-18_$folder$

如何避免创建这些额外的空 _$folder$ 文件.

How to avoid creation of these extra empty _$folder$ files.

我在 https://issues.apache.org/jira/browse/HADOOP 中找到了一个解决方案-10400 但我不知道如何在 AWS 数据管道中实现它.

I found a solution listed at https://issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline.

推荐答案

EMR 似乎没有提供避免这种情况的方法.

EMR doesn't seem to provide a way to avoid this.

由于 S3 使用键值对存储系统,Hadoop 文件系统通过创建带有_$folder$"后缀的空文件来实现 S3 中的目录支持.

您可以安全地删除出现在 S3 存储桶中的带有 _$folder$ 后缀的任何空文件.这些空文件是由 Hadoop 框架在运行时创建的,但 Hadoop 旨在处理数据,即使这些空文件被删除.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

它在 Hadoop 源代码中,所以它可以被修复,但显然它没有在 EMR 中修复.

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

如果您觉得自己很聪明,可以创建一个与 _$folder$ 后缀匹配的 S3 事件通知,并让它触发 Lambda 函数以在对象创建后删除它们.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.

这篇关于避免使用 hadoop (EMR) 在 S3 中创建 _$folder$ 键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 08:36