有没有办法识别或检测

有没有办法识别或检测

本文介绍了有没有办法识别或检测 Hive 表中的数据倾斜?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有许多需要花费大量时间的 hive 查询.我们正在使用 tez 和其他良好实践,例如 CBO、使用 orc 文件等.

有没有办法像某些命令一样检查/分析数据偏差?解释计划有帮助吗?如果有,我应该寻找哪个参数?

解决方案

解释计划对此没有帮助,您应该检查数据.如果是join,则从join涉及的所有表中选择前100个join key值,如果是解析函数,按key分区也一样,看是不是skew.

示例:

选择键,count(*) cnt从表按键分组有计数(*)>1000 --check also >1 对于不应重复的表(如维度)按 cnt desc 限制订购 100;

key 可以是复杂的连接键(您在连接 ON 条件下使用的所有列).

也看看这个答案:https://stackoverflow.com/a/51061613/2700344>

We have many hive queries that take lot of time. We are using tez and other good practices like CBO, using orc files etc.

Is there a way to check / analyze data skew like some command? Would an explain plan help and if so, which parameter should I look for?

解决方案

Explain plan will not help in this, you should check data. If it is a join, select top 100 join key value from all tables involved in the join, do the same for partition by key if it is analytic function and you will see if it is a skew.

Example:

select key, count(*) cnt
   from table
  group by key
 having count(*)> 1000 --check also >1 for tables where it should not be duplication (like dimentions)
  order by cnt desc limit 100;

key can be complex join key (all columns you are using in the join ON condition).

Also have a look at this answer: https://stackoverflow.com/a/51061613/2700344

这篇关于有没有办法识别或检测 Hive 表中的数据倾斜?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 16:05