问题描述
推荐使用哪种工具来每天/每周调度Spark作业.1)Oozie2)路易吉3)阿兹卡班4)计时5)气流
Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis.1) Oozie2) Luigi3) Azkaban4) Chronos5) Airflow
谢谢.
推荐答案
从此处更新我以前的答案:
Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines
- 气流:先尝试一下.体面的用户界面,类似于Python的作业定义,对于非程序员来说半可访问,依赖声明语法很奇怪.
- Airflow内置了对事实的支持,即计划的作业通常需要重新运行和/或回填.确保您建立了支持此功能的管道.
- Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
- Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this.
- Azkaban要求简单(不能使用不存在的功能),而其他则巧妙地鼓励复杂性.
- 签出Azkaban CLI项目以编程方式创建作业. https://github.com/mtth/azkaban (示例 https://github.com/joeharris76/azkaban_examples )
- Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
- Check out the Azkaban CLI project for programmatic job creation. https://github.com/mtth/azkaban (examples https://github.com/joeharris76/azkaban_examples)
哲学:
简单的管道比复杂的管道要好:易于创建,易于理解(尤其是在未创建时),并且易于调试/修复.
Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.
当需要复杂的操作时,您希望以完全成功或完全失败的方式封装它们.
When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.
如果您可以使其幂等(再次运行它可以产生相同的结果),那就更好了.
If you can make it idempotent (running it again creates identical results) then that’s even better.
这篇关于及时安排火花作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!