本文介绍了Glue搜寻器从分区的S3存储桶中创建了多个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个S3存储桶,其结构如下:

I have an S3 bucket that is structured like this:

root/
├── year=2020/
│   └── month=01
│       ├── day=01 
|       |     ├──  file1.log
|       |     ├──  ...
|       |     └──  file8.log
│       ├── day=...
│       └── day=31 
|             ├──  file1.log
|             ├──  ...
|             └──  file8.log
└── year=2019/
        ├── ...

每天中有8个文件在一天中的名称相同─每个"day"文件夹中都会有一个 file1.log .我使用自定义分类器抓取了此存储桶.

Each day would have 8 files with identical names across the days ─ there would be a file1.log in every 'day' folders. I crawled this bucket using a custom classifier.

预期的行为:Glue将创建一个单独的表,其中以年,月和日为分区字段,以及我在自定义分类器中描述的其他几个字段.然后,我可以在Job脚本中使用该表.

Expected behavior: Glue will create one single table with year, month, and day as partition fields, and several other fields that I described in my custom classifier. I then can use the table in my Job scripts.

实际行为:

1)胶水创建了一个符合我期望的桌子.但是,当我尝试在Job脚本中访问它时,表中没有列.

1) Glue created one table that fulfilled my expectations. However, when I tried to access it in Job scripts, the table was devoid of columns.

2)胶水为每个'day'分区创建了一个表,为每个 file< number> .log 文件

2) Glue created one table for every 'day' partitions, and 8 tables for every file<number>.log files

我已经尝试排除 ** _ SUCCESS ** crc ,就像在另一个问题上建议的人一样:但是,这似乎不起作用.我还检查了搜寻器设置中的为每个S3路径创建单个架构"选项.仍然不起作用.

I have tried excluding **_SUCCESS and **crc like people suggested on this other question: AWS Glue Crawler adding tables for every partition? However, it doesn't seem to work. I have also checked the 'Create a single schema for each S3 path' option in the crawler's setting. It still doesn't work.

我想念什么?

推荐答案

您应该在根目录(例如客户)拥有一个文件夹,并且在其中应具有分区子文件夹.如果您在S3存储桶级别具有分区,它将不会创建一个表.

You should have one folder at root (e.g. customers) and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.

这篇关于Glue搜寻器从分区的S3存储桶中创建了多个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 07:21