问题描述
我需要在 Apache Hive 中挂钩一个自定义执行挂钩.如果有人知道怎么做,请告诉我.
I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
我当前使用的环境如下:
The current environment I am using is given below:
Hadoop:Cloudera 版本 4.1.2操作系统:Centos
Hadoop : Cloudera version 4.1.2Operating system : Centos
谢谢,阿伦
推荐答案
根据您要在哪个阶段注入自定义代码,有多种类型的钩子:
There are several types of hooks depending on at which stage you want to inject your custom code:
- 驱动程序运行挂钩(前/后)
- 语义分析器挂钩(前/后)
- 执行挂钩(前/失败/后)
- 客户统计信息发布者
如果您运行脚本,处理流程如下所示:
If you run a script the processing flow looks like as follows:
- Driver.run() 接受命令
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS
)- Driver.compile() 开始处理命令:创建抽象语法树
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK
)- 语义分析
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK
)- 创建并验证查询计划(物理计划)
- Driver.execute() :准备运行作业
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS
)- ExecDriver.execute() 运行所有作业
- 对于每个 HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL 间隔的每个作业:
调用ClientStatsPublisher.run()
发布统计信息
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS
)
如果任务失败:ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS
) - 完成所有任务
ExecuteWithHookContext.run()
(HiveConf.ConfVars.POSTEXECHOOKS
)- 返回结果之前
HiveDriverRunHook.postDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS
) - 返回结果.
- Driver.run() takes the command
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS
)- Driver.compile() starts processing the command: creates the abstract syntax tree
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK
)- Semantic analysis
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK
)- Create and validate the query plan (physical plan)
- Driver.execute() : ready to run the jobs
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS
)- ExecDriver.execute() runs all the jobs
- For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run()
is called to publish statistics
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS
)
If a task fails:ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS
) - Finish all the tasks
ExecuteWithHookContext.run()
(HiveConf.ConfVars.POSTEXECHOOKS
)- Before returning the result
HiveDriverRunHook.postDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS
) - Return the result.
对于每个钩子,我都指出了您必须实现的接口.在括号中有相应的conf.支柱.您必须设置的密钥才能注册脚本开头的类.例如:设置 PreExecution 挂钩(工作流的第 9 阶段)
For each of the hooks I indicated the interfaces you have to implement. In the bracketsthere's the corresponding conf. prop. key you have to set in order to register theclass at the beginning of the script.E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
不幸的是,这些功能并未真正记录在案,但您可以随时查看 Driver 类,查看钩子的求值顺序.
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
备注:我假设这里是 Hive 0.11.0,我不认为 Cloudera 发行版不同(太多)
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distributiondiffers (too much)
这篇关于Hive 执行钩子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!