通过S3创建日期的分区Athena查询

通过S3创建日期的分区Athena查询

本文介绍了通过S3创建日期的分区Athena查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个S3存储桶,其中包含约7000万个JSON(约15TB)和一个雅典娜表,可按时间戳和JSON中定义的其他键进行查询。

I have a S3 bucket with ~ 70 million JSONs (~ 15TB) and an athena table to query by timestamp and some other keys definied in the JSON.

它可以保证JSON中的时间戳或多或少等于JSON的S3createdDate(或至少等于我的查询目的)

It is guaranteed, that the timestamp in the JSON is more or less equal to the S3-createdDate of the JSON (or at least equal enough for the purpose of my query)

我能否以某种方式通过将createddate添加为 partition之类的东西来提高查询性能(和成本)-我无法理解,前缀/文件夹似乎只能做到这一点?

Can I somehow improve querying-performance (and cost) by adding the createddate as something like a "partition" - which I unterstand seems only to be possible for prefixes/folders?

edit:
我目前通过使用S3库存CSV通过createdDate进行预过滤,然后下载所有JSON并进行其余过滤来进行模拟,但如果可能的话,我想完全在雅典娜内部进行过滤

edit:I currently simulate that by using the S3 inventory CSV to pre-filter by createdDate and then download all JSONs and do the rest of the filtering, but I'd like to do that completely inside athena, if possible

推荐答案

无法使Athena使用S3对象元数据之类的东西进行查询计划。使Athena跳过读取对象的唯一方法是组织对象,以便可以建立分区表,然后使用分区键上的过滤器进行查询。

There is no way to make Athena use things like S3 object metadata for query planning. The only way to make Athena skip reading objects is to organize the objects in a way that makes it possible to set up a partitioned table, and then query with filters on the partition keys.

听起来您似乎对如何在雅典娜有效,我认为您没有使用它是有原因的。但是,为了让遇到类似问题的其他人受益,我将首先说明如果您可以更改对象的组织方式,可以做什么。最后,我将提供另一种建议,您可能想直接跳到那。

It sounds like you have an idea of how partitioning in Athena works, and I assume there is a reason that you are not using it. However, for the benefit of others with similar problems coming across this question I'll start by explaining what you can do if you can change the way the objects are organized. I'll give an alternative suggestion at the end, you may want to jump straight to that.

我建议您使用包含以下内容的前缀来组织JSON对象:对象的时间戳。究竟多少取决于查询数据的方式。您不希望它过于细小和粗糙。使其过于精细将使Athena花更多时间在S3上列出文件,使其过于粗糙将使其读取太多文件。如果最常见的查询时间是一个月,那是一个很好的粒度;如果最常见的时间是几天,那么一天可能会更好。

I would suggest you organize the JSON objects using prefixes that contain some part of the timestamps of the objects. Exactly how much depends on the way you query the data. You don't want it too granular and not too coarse. Making it too granular will make Athena spend more time listing files on S3, making it too coarse will make it read too many files. If the most common time period of queries is a month, that is a good granularity, if the most common period is a couple of days then day is probably better.

例如,如果day是数据集的最佳粒度,则可以使用如下键来组织对象:

For example, if day is the best granularity for your dataset you could organize the objects using keys like this:

s3://some-bucket/data/2019-03-07/object0.json
s3://some-bucket/data/2019-03-07/object1.json
s3://some-bucket/data/2019-03-08/object0.json
s3://some-bucket/data/2019-03-08/object1.json
s3://some-bucket/data/2019-03-08/object2.json

您还可以使用Hive样式的分区方案,这是其他方法像Glue,Spark和Hive这样的工具都可以使用,因此,除非您有理由不这样做,否则将来可以避免您的痛苦:

You can also use a Hive-style partitioning scheme, which is what other tools like Glue, Spark, and Hive expect, so unless you have reasons not to it can save you grief in the future:

s3://some-bucket/data/created_date=2019-03-07/object0.json
s3://some-bucket/data/created_date=2019-03-07/object1.json
s3://some-bucket/data/created_date=2019-03-08/object0.json

我在这里选择了 created_date 这个名字,我没有知道什么是您的数据的好名字。您只能使用 date ,但请记住始终将其引用(并在DML和DDL中以不同的方式引用…),因为它是保留字。

I chose the name created_date here, I don't know what would be a good name for your data. You can use just date, but remember to always quote it (and quote it in different ways in DML and DDL…) since it's a reserved word.

然后创建一个分区表:

CREATE TABLE my_data (
  column0 string,
  column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')

一些指南会告诉您运行 MSCK REPAIR TABLE 加载表的分区。如果您使用Hive样式的分区(即…/ created_date = 2019-03-08 /…),则可以执行此操作,但是这将花费很长时间,而我不会不推荐。通过手动添加分区,您可以做得更好:

Some guides will then tell you to run MSCK REPAIR TABLE to load the partitions for the table. If you use Hive-style partitioning (i.e. …/created_date=2019-03-08/…) you can do this, but it will take a long time and I wouldn't recommend it. You can do a much better job of it by manually adding the partitions, which you do like this:

ALTER TABLE my_data ADD
  PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
  PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'

最后,当您查询表时,请确保包括 created_date 列以向Athena提供信息,该信息只需要读取与查询相关的对象即可:

Finally, when you query the table make sure to include the created_date column to give Athena the information it needs to read only the objects that are relevant for the query:

SELECT COUNT(*)
FROM my_data
WHERE created_date >= DATE '2019-03-07'

您可以通过观察更改后的数据差异来验证查询是否便宜例如 created_date> = DATE'2019-03-07' created_date = DATE'2019-03-07'

You can verify that the query will be cheaper by observing the difference in the data scanned when you change from for example created_date >= DATE '2019-03-07' to created_date = DATE '2019-03-07'.

如果您不是ab为了更改对象在S3上的组织方式,有一个文档记录很差的功能,即使您无法更改数据对象,也可以创建分区表。您要做的是创建与我上面建议的相同的前缀,但是不要将JSON对象移入此结构,而是在每个分区的前缀中放置一个名为 symlink.txt 的文件:

If you are not able to change the way the objects are organized on S3, there is a poorly documented feature that makes it possible to create a partitioned table even when you can't change the data objects. What you do is you create the same prefixes as I suggest above, but instead of moving the JSON objects into this structure you put a file called symlink.txt in each partition's prefix:

s3://some-bucket/data/created_date=2019-03-07/symlink.txt
s3://some-bucket/data/created_date=2019-03-08/symlink.txt

每个 symlink.txt 都将要包含在该分区中的文件的完整S3 URI放入。例如,在第一个文件中,您可以放置​​:

In each symlink.txt you put the full S3 URI of the files that you want to include in that partition. For example, in the first file you could put:

s3://data-bucket/data/object0.json
s3://data-bucket/data/object1.json

和第二个文件:

s3://data-bucket/data/object2.json
s3://data-bucket/data/object3.json
s3://data-bucket/data/object4.json

然后您将创建一个与上表非常相似的表,但有一个小差异:

Then you create a table that looks very similar to the table above, but with one small difference:

CREATE TABLE my_data (
  column0 string,
  column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')

注意输入格式T 属性。

您可以像对任何分区表一样添加分区:

You add partitions just like you do for any partitioned table:

ALTER TABLE my_data ADD
  PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
  PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'

我为此遇到的唯一与雅典娜相关的文档是。

The only Athena-related documentation of this feature that I have come across for this is the S3 Inventory docs for integrating with Athena.

这篇关于通过S3创建日期的分区Athena查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 03:21