问题描述
我有一个S3存储桶,其结构如下:
I have an S3 bucket that is structured like this:
root/
├── year=2020/
│ └── month=01
│ ├── day=01
| | ├── file1.log
| | ├── ...
| | └── file8.log
│ ├── day=...
│ └── day=31
| ├── file1.log
| ├── ...
| └── file8.log
└── year=2019/
├── ...
每天中有8个文件在一天中的名称相同─每个"day"文件夹中都会有一个 file1.log
.我使用自定义分类器抓取了此存储桶.
Each day would have 8 files with identical names across the days ─ there would be a file1.log
in every 'day' folders. I crawled this bucket using a custom classifier.
预期的行为:Glue将创建一个单独的表,其中以年,月和日为分区字段,以及我在自定义分类器中描述的其他几个字段.然后,我可以在Job脚本中使用该表.
Expected behavior: Glue will create one single table with year, month, and day as partition fields, and several other fields that I described in my custom classifier. I then can use the table in my Job scripts.
实际行为:
1)胶水创建了一个符合我期望的桌子.但是,当我尝试在Job脚本中访问它时,表中没有列.
1) Glue created one table that fulfilled my expectations. However, when I tried to access it in Job scripts, the table was devoid of columns.
2)胶水为每个'day'分区创建了一个表,为每个 file< number> .log
文件
2) Glue created one table for every 'day' partitions, and 8 tables for every file<number>.log
files
我已经尝试排除 ** _ SUCCESS
和 ** crc
,就像在另一个问题上建议的人一样:但是,这似乎不起作用.我还检查了搜寻器设置中的为每个S3路径创建单个架构"选项.仍然不起作用.
I have tried excluding **_SUCCESS
and **crc
like people suggested on this other question: AWS Glue Crawler adding tables for every partition? However, it doesn't seem to work. I have also checked the 'Create a single schema for each S3 path' option in the crawler's setting. It still doesn't work.
我想念什么?
推荐答案
您应该在根目录(例如客户)拥有一个文件夹,并且在其中应具有分区子文件夹.如果您在S3存储桶级别具有分区,它将不会创建一个表.
You should have one folder at root (e.g. customers) and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.
这篇关于Glue搜寻器从分区的S3存储桶中创建了多个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!