sql - hive :优化长时间运行的查询

在50GB大小的员工日志表上运行的简单Hive SQL查询运行了几个小时。

select dept,count(distinct emp_id) from emp_log group by dept;

只有4-5个部门，每个部门有大量员工。

它在1TB内存上以Hive 0.14 + Tez运行。有什么方法可以优化此代码块以获得更好的性能？

修改1
测试用collect_list替换不重复。
SELECT dept, size(collect_list(emp_id)) nb_empsFROM emp_logGROUP BY dept
出现以下错误，

Status: Failed Vertex failed, vertexName=Reducer 2,vertexId=vertex_1446976653619_0043_1_02, diagnostics=[Task failed,taskId=task_1446976653619_0043_1_02_000282, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space

最佳答案

您应该尝试避免count(distinct foo):

SELECT dept, size(collect_list(emp_id)) nb_emps
FROM emp_log
GROUP BY dept

count(distinct x)在HIVE 0.14中无效。

另外，您应该为以下列激活统计信息:

ANALYZE TABLE emp_log COMPUTE STATISTICS;
ANALYZE TABLE emp_log COMPUTE STATISTICS FOR COLUMNS dept, emp_id;

关于sql - hive :优化长时间运行的查询，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/33598226/