基于时间的目录结构 Apache Drill

本文介绍了基于时间的目录结构 Apache Drill的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有如下按日期和时间组织的 CSV 文件

I have CSV files organized by date and time as follows

logs/YYYY/MM/DD/CSV files...

我已设置 Apache Drill 以在这些 CSV 文件之上执行 SQL 查询.由于有很多CSV文件；可以利用文件的组织来优化性能.例如，

I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,

SELECT * from data where trans>='20170101' AND trans<'20170102';

在此 SQL 中，应扫描目录 logs/2017/01/01 中的数据.有没有办法让Apache Drill根据这个目录结构做优化?是否可以在 Hive、Impala 或任何其他工具中执行此操作?

In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?

请注意:

SQL 查询几乎总是包含时间范围.
给定目录中的 CSV 文件数量并不多.结合所有年份的数据，这将是巨大的
每个 CSV 文件中都有一个名为trans"的字段，其中包含日期和时间.
CSV 文件会根据 'trans' 字段的值放置在适当的目录下.
CSV 文件不遵循任何架构.列可能不同，也可能不同.

推荐答案

使用数据文件内的列查询对分区修剪没有帮助.

Querying using column inside the data file would not help in partition pruning.

您可以在 Drill 中使用 dir* 变量来引用表中的分区.

You can use dir* variables in Drill to refer to partitions in table.

create view trans_logs_view as
select
 `dir0` as `tran_year`,
 `dir1` as `trans_month`,
 `dir2` as `tran_date`, * from dfs.`/data/logs`;

您可以使用 tran_year、tran_month 和 tran_date 列进行查询以进行分区修剪.

You can query using tran_year,tran_month and tran_date columns for partition pruning.

还要看看下面的查询是否有助于修剪.

Also see if below query helps for pruning.

select count(1)  from dfs.`/data/logs`
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';

如果是这样，您可以通过将 concat(dir0,dir1,dir2) 别名为 trans 来定义视图列名和查询.

If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.

请参阅下文了解更多详情.

See below for more details.

https://drill.apache.org/docs/how-到分区数据/

这篇关于基于时间的目录结构 Apache Drill的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！