Glue搜寻器从分区的S3存储桶中创建了多个表

本文介绍了Glue搜寻器从分区的S3存储桶中创建了多个表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个S3存储桶，其结构如下:

I have an S3 bucket that is structured like this:

root/
├── year=2020/
│   └── month=01
│       ├── day=01 
|       |     ├──  file1.log
|       |     ├──  ...
|       |     └──  file8.log
│       ├── day=...
│       └── day=31 
|             ├──  file1.log
|             ├──  ...
|             └──  file8.log
└── year=2019/
        ├── ...

每天中有8个文件在一天中的名称相同─每个"day"文件夹中都会有一个 file1.log .我使用自定义分类器抓取了此存储桶.

Each day would have 8 files with identical names across the days ─ there would be a file1.log in every 'day' folders. I crawled this bucket using a custom classifier.

预期的行为:Glue将创建一个单独的表，其中以年，月和日为分区字段，以及我在自定义分类器中描述的其他几个字段.然后，我可以在Job脚本中使用该表.

Expected behavior: Glue will create one single table with year, month, and day as partition fields, and several other fields that I described in my custom classifier. I then can use the table in my Job scripts.

实际行为:

1)胶水创建了一个符合我期望的桌子.但是，当我尝试在Job脚本中访问它时，表中没有列.

1) Glue created one table that fulfilled my expectations. However, when I tried to access it in Job scripts, the table was devoid of columns.

2)胶水为每个'day'分区创建了一个表，为每个 file< number> .log 文件

2) Glue created one table for every 'day' partitions, and 8 tables for every file<number>.log files

我已经尝试排除 ** _ SUCCESS 和 ** crc ，就像在另一个问题上建议的人一样:但是，这似乎不起作用.我还检查了搜寻器设置中的为每个S3路径创建单个架构"选项.仍然不起作用.

I have tried excluding **_SUCCESS and **crc like people suggested on this other question: AWS Glue Crawler adding tables for every partition? However, it doesn't seem to work. I have also checked the 'Create a single schema for each S3 path' option in the crawler's setting. It still doesn't work.

我想念什么?

Glue

Glue搜寻器从分区的S3存储桶中创建了多个表

问题描述

推荐答案