本文介绍了如何访问分区的Athena表的子目录中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个每天都有分区的Athena表,其中实际文件按小时在子目录"中,如下所示:

I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows:

s3://my-bucket/data/2019/06/27/00/00001.json
s3://my-bucket/data/2019/06/27/00/00002.json
s3://my-bucket/data/2019/06/27/01/00001.json
s3://my-bucket/data/2019/06/27/01/00002.json

Athena可以毫无问题地查询该表并找到我的数据,但是在使用AWS Glue时,它似乎无法找到该数据.

Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data.

ALTER TABLE mytable ADD 
PARTITION (year=2019, month=06, day=27) LOCATION 's3://my-bucket/data/2019/06/27/01';

select day, count(*)
from mytable
group by day;

day .   count
27 .    145431

我已经尝试将分区的位置更改为以斜杠结尾( s3://my-bucket/data/2019/06/27/01/),但这没有帮助.

I've already tried changing the location of the partition to end with a trailing slash (s3://my-bucket/data/2019/06/27/01/), but this didn't help.

以下是Glue中的分区属性.我希望storedAsSubDirectories设置可以告诉它迭代子目录,但是事实并非如此:

Below are the partition properties in Glue. I was hoping that the storedAsSubDirectories setting would tell it to iterate the sub-directories, but this does not appear to be the case:

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "userid",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "labels",
                    "type": "array<string>",
                    "comment": ""
                }
            ]
        },
        "location": "s3://my-bucket/data/2019/06/27/01/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "0",
        "SerDeInfo": {
            "name": "JsonSerDe",
            "serializationLib": "org.openx.data.jsonserde.JsonSerDe",
            "parameters": {
                "serialization.format": "1"
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {},
        "SkewedInfo": {
            "skewedColNames": [],
            "skewedColValues": [],
            "skewedColValueLocationMaps": {}
        },
        "storedAsSubDirectories": "true"
    },
    "parameters": {}
}

当Glue在相同的分区/表上运行时,它将发现0行.

When Glue runs against this same partition/table, it finds 0 rows.

但是,如果所有数据文件都出现在分区的根目录"中(即s3://my-bucket/data/2019/06/27/00001.json),则Athena和Glue都可以找到数据.

However, if all the data files appear in the root "directory" of the partition (i.e. s3://my-bucket/data/2019/06/27/00001.json), then both Athena and Glue can find the data.

是否有某些原因导致Glue无法找到数据文件?我不希望每个小时都创建一个分区,因为那将意味着每年8700个分区(而Athena的每个表限制为20,000个分区).

Is there some reason why Glue is unable to find the data files? I'd prefer not to create a partition for each hour, since that will mean 8700 partitions per year (and Athena has a limit of 20,000 partitions per table).

推荐答案

显然,create_dynamic_frame上有一个未记录的递归"附加选项: additional_options = {"recurse":True}

Apparently there's an undocumented additional option on create_dynamic_frame for "recurse":additional_options = {"recurse": True}

示例:

athena_datasource = gumContext.create_dynamic_frame.from_catalog(数据库= target_database,表名= target_table,push_down_predicate =(year =='2019'and month =='06'and day =='27')",Transformation_ctx="athena_datasource",additional_options = {"recurse":True})

我刚刚使用此选项测试了我的Glue作业,可以确认现在可以找到所有s3文件.

I have just tested my Glue job with this option and can confirm that it now finds all s3 files.

这篇关于如何访问分区的Athena表的子目录中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 07:21