sql - Hive SQL编码风格:中间表？

我应该在配置单元中创建和删除中间表吗？

我可以写一些类似的东西(简化很多):

drop table if exists tmp1;
create table tmp1 as
select a, b, c
from input1
where a > 1 and b < 3;

drop table if exists tmp2;
create table tmp2 as
select x, y, z
from input2
where x < 6;

drop table if exists output;
create table output as
select x, a, count(*) as count
from tmp1 join tmp2 on tmp1.c = tmp2.z
group by tmp1.b;
drop table tmp1;
drop table tmp2;

或者我可以将所有内容汇总为一个语句:

drop table if exists output;
create table output as
select x, a, count(*) as count
from (select a, b, c
    from input1
    where a > 1 and b < 3) t1
join (select x, y, z
    from input2
    where x < 6) t2
on t1.c = t2.z
group by t1.b;

显然，如果我多次使用中间表，那么创建中间表是很有意义的。
但是，当它们仅使用一次时，我可以选择。

我尝试了两种方法，第二种方法是按照壁挂时间测量的速度 6％更快，但通过MapReduce Total cumulative CPU time日志输出的测量速度 4％慢。
这种差异可能在随机误差范围内(由其他过程＆c引起)。
但是，合并查询是否有可能导致戏剧性的提速？

另一个问题是:是仅使用一次的中间表，还是 hive 代码中的正常现象，还是应该尽可能避免使用它们？

最佳答案

有一个明显的区别。
运行一个大查询将使优化器在优化中拥有更大的自由度。
在这种情况下，最重要的优化之一就是在hive.exec.parallel中设置的并列。设置为true时，配置单元将并行执行独立阶段。
在您的情况下，在第二个查询中，假设t1，t2做更复杂的工作，例如group by。在第二个查询t1，t2中将同时执行，而在第一个脚本中将是串行的。

关于sql - Hive SQL编码风格:中间表？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/20957799/