本文介绍了雅典娜:最小化包括JOIN操作在内的查询扫描的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让Athena中有一个外部表,该表指向s3上以镶木地板格式存储的大量数据。它包含许多列,并在名为 timeid的字段上进行分区。现在,还有另一个外部表(小表)将timeid映射到日期。

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.

当较小的表也在timeid上分区时,我们将它们加入其分区ID(timeid)中)并将date放入where子句,则仅从包含对应于该日期的timeid的大表中扫描那些特定记录。

When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.

但是,如果较小的表未在timeid上进行分区,则即使在date列中存在条件时,也会进行完整的数据扫描。

However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.

即使将大分区表与未分区的小表连接在一起,也可以避免完整数据扫描吗?这是必需的,因为小表每个timeid仅包含一个记录,并且可能不希望为每个timeid创建一个单独的文件。

Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.

推荐答案

这是一个有趣的发现!

通过使用子查询而不是联接,您可以避免进行大扫描

You might be able to avoid the large scan by using a sub-query instead of a join.

而不是:

SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'

您可能可以使用:

SELECT ...
FROM large-table
WHERE large-table.date IN
         (SELECT date from small-table
          WHERE date > '2017-08-03')

我还没有对它进行了测试,但这可以避免您提到的JOIN。

I haven't tested it, but that would avoid the JOIN you mention.

这篇关于雅典娜:最小化包括JOIN操作在内的查询扫描的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-26 22:16