本文介绍了AWS Athena:“ msck修复表”会产生费用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3中有ORC数据,如下所示:

I have ORC data in S3 that looks like this:

s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/
s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/
s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/

每个小时,我运行一个EMR作业,它将S3中的原始JSON转换为ORC,并使用路径分区约定(如上)将其写出以供Athena接收。 EMR作业完成后,我运行 msck修复表,以便Athena可以选择新分区。

Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions.

我有3个相关问题:


  1. 在这种情况下是否运行 msck修复表在AWS中赚钱吗?

  2. msck修复表可能会超时。我有办法在数据管道中迈出一步来继续运行此命令,直到命令成功完成?

  3. 我希望手动将分区添加到Athena(因为我知道这一年,我正在工作的月,日,小时)。但是,我不知道 clientId ,因为它们可能是1-X,并且我不知道在运行EMR时存在哪些。是否有解决此问题的最佳实践方法(使用Hive或其他方法)?我可以进行s3 api调用以获取 s3:// bucket / org / 的列表,并编写代码以遍历该列表并手动添加。我希望有一种更简单的方法...

  1. Does running msck repair table in this scenario, cost me money in AWS?
  2. AWS Docs say msck repair table can timeout. Is there a way I can make a step in data pipeline to continue running this command until it completes successfully?
  3. I would prefer to add the partitions manually to Athena (since I know the year,month,day,hour I'm working on). However I do not know the clientId because there could be 1-X of them, and I don't know which ones exist at time of running EMR. Is there a best practice way to solve this problem (using Hive or something else)? I could make an s3 api call to get a list of s3://bucket/org/ and write code to iterate over list and add manually. I'm hoping there is an easier way...

注意:当我说手动添加分区时,我的意思是

Note: when I say "add partitions manually" I mean doing something like this:

ALTER TABLE <athena table>
ADD PARTITION (clientId='client-1',year=2017,month=3,day=16,hour=20)
location 's3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/';


推荐答案

我尚不知道如何自动执行 msck修复表完成。

I do not yet know how to automate msck repair table to make sure it completes.

这篇关于AWS Athena:“ msck修复表”会产生费用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-21 18:00