问题描述
我有如下按日期和时间组织的 CSV 文件
I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
我已设置 Apache Drill 以在这些 CSV 文件之上执行 SQL 查询.由于有很多CSV文件;可以利用文件的组织来优化性能.例如,
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
在此 SQL 中,应扫描目录 logs/2017/01/01
中的数据.有没有办法让Apache Drill根据这个目录结构做优化?是否可以在 Hive、Impala 或任何其他工具中执行此操作?
In this SQL, the directory logs/2017/01/01
should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
请注意:
- SQL 查询几乎总是包含时间范围.
- 给定目录中的 CSV 文件数量并不多.结合所有年份的数据,这将是巨大的
- 每个 CSV 文件中都有一个名为trans"的字段,其中包含日期和时间.
- CSV 文件会根据 'trans' 字段的值放置在适当的目录下.
- CSV 文件不遵循任何架构.列可能不同,也可能不同.
推荐答案
使用数据文件内的列查询对分区修剪没有帮助.
Querying using column inside the data file would not help in partition pruning.
您可以在 Drill 中使用 dir* 变量来引用表中的分区.
You can use dir* variables in Drill to refer to partitions in table.
create view trans_logs_view as
select
`dir0` as `tran_year`,
`dir1` as `trans_month`,
`dir2` as `tran_date`, * from dfs.`/data/logs`;
您可以使用 tran_year、tran_month 和 tran_date 列进行查询以进行分区修剪.
You can query using tran_year,tran_month and tran_date columns for partition pruning.
还要看看下面的查询是否有助于修剪.
Also see if below query helps for pruning.
select count(1) from dfs.`/data/logs`
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';
如果是这样,您可以通过将 concat(dir0
,dir1
,dir2
) 别名为 trans
来定义视图列名和查询.
If so , you can define view by aliasing concat(dir0
,dir1
,dir2
) to trans
column name and query.
请参阅下文了解更多详情.
See below for more details.
https://drill.apache.org/docs/how-到分区数据/
这篇关于基于时间的目录结构 Apache Drill的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!