问题描述
我们有许多需要花费大量时间的 hive 查询.我们正在使用 tez 和其他良好实践,例如 CBO、使用 orc 文件等.
有没有办法像某些命令一样检查/分析数据偏差?解释计划有帮助吗?如果有,我应该寻找哪个参数?
解释计划对此没有帮助,您应该检查数据.如果是join,则从join涉及的所有表中选择前100个join key值,如果是解析函数,按key分区也一样,看是不是skew.
示例:
选择键,count(*) cnt从表按键分组有计数(*)>1000 --check also >1 对于不应重复的表(如维度)按 cnt desc 限制订购 100;
key
可以是复杂的连接键(您在连接 ON 条件下使用的所有列).
也看看这个答案:https://stackoverflow.com/a/51061613/2700344>
We have many hive queries that take lot of time. We are using tez and other good practices like CBO, using orc files etc.
Is there a way to check / analyze data skew like some command? Would an explain plan help and if so, which parameter should I look for?
Explain plan will not help in this, you should check data. If it is a join, select top 100 join key value from all tables involved in the join, do the same for partition by key if it is analytic function and you will see if it is a skew.
Example:
select key, count(*) cnt
from table
group by key
having count(*)> 1000 --check also >1 for tables where it should not be duplication (like dimentions)
order by cnt desc limit 100;
key
can be complex join key (all columns you are using in the join ON condition).
Also have a look at this answer: https://stackoverflow.com/a/51061613/2700344
这篇关于有没有办法识别或检测 Hive 表中的数据倾斜?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!