airflow 1.10.0
官方:http://airflow.apache.org/
一 简介
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
airflow是一个可以通过python代码来编排、调度和监控工作流的平台;工作流是一系列task的dag(directed acyclic graphs,有向无环图);
1 集群角色
webserver
web server 使用 gunicorn 服务器,通过airflow.cfg中workers配置并发进程数;
scheduler
The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it spins up a subprocess, which monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) collects DAG parsing results and inspects active tasks to see whether they can be triggered.
worker
四种Executor:SequentialExecutor、LocalExecutor、CeleryExecutor、MesosExecutor:
1)Airflow uses a sqlite database, which you should outgrow fairly quickly since no parallelization is possible using this database backend. It works in conjunction with the SequentialExecutor which will only run task instances sequentially.
2)LocalExecutor, tasks will be executed as subprocesses;
3)CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, …) and change your airflow.cfg to point the executor parameter to CeleryExecutor and provide the related Celery settings.
4)MesosExecutor allows you to schedule airflow tasks on a Mesos cluster.
SequentialExecutor搭配sqlite库使用,LocalExecutor使用子进程来执行任务,CeleryExecutor需要依赖backend执行(比如RabbitMQ或Redis),MesosExecutor会提交任务到mesos集群;
2 概念
DAG
In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
dag是一系列task的集合按照依赖关系组织成有向无环图,相当于workflow;
Operator
An operator describes a single task in a workflow. Operators are usually (but not always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators. The DAG will make sure that operators run in the correct certain order; other than those dependencies, operators generally run independently. In fact, they may run on two completely different machines.
operator描述了工作流中的一个task,是一个抽象的概念,相当于抽象task定义;
Task
Once an operator is instantiated, it is referred to as a “task”. The instantiation defines specific values when calling the abstract operator, and the parameterized task becomes a node in a DAG.
operator实例化(构造函数)之后成为task,task是一个具体的概念,作为dag的一部分;
DAG Run
A DAG Run is an object representing an instantiation of the DAG in time.
dag run是一个dag的实例对象,相当于workflow instance;
Task Instance
A task instance represents a specific run of a task and is characterized as the combination of a dag, a task, and a point in time. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc.
task每次执行都会生成一个task instance,每个task instance都有状态,比如running、success、failed等;
二 安装
ambari安装
详见:https://www.cnblogs.com/barneywill/p/10284804.html
docker安装
详见:https://www.cnblogs.com/barneywill/p/10397260.html
手工安装
1 检查python
2 安装pip
pip is already installed if you are using Python 2 >=2.7.9 or Python 3 >=3.4 downloaded from python.org
3 安装airflow
1)如果报错:
需要设置环境变量
2)如果报错:
需要安装
4 设置环境变量
默认在 /usr/local/airflow
5 验证
自动创建$AIRFLOW_HOME/airflow.cfg
6 修改数据库配置
修改如下配置
修改sql_alchemy_conn为mysql或postgres连接串,同时将executor改为LocalExecutor
7 初始化db
8 常用命令
如果报错
重装urllib3
如果还有问题,重装chardet、idna、urllib3
三 使用
1 dag
dag示例:
from datetime import timedelta, datetime
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator default_args = {
'owner': 'www',
'depends_on_past': False,
'start_date': datetime(2019, 1, 25),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
} dag = DAG(
'hello_dag',
default_args=default_args,
description='hello world DAG',
schedule_interval='*/5 * * * *'
) start_operator = DummyOperator(task_id='start_task', dag=dag) sh_hello_operator = BashOperator(
task_id='sh_hello_task',
depends_on_past=False,
bash_command='echo "hello {{ params.p }} : "`date` >> /tmp/test.txt',
params={'p':'world'},
dag=dag
) def print_hello():
return 'Hello world!' py_hello_operator = PythonOperator(
task_id='py_hello_task',
python_callable=print_hello,
dag=dag) start_operator >> sh_hello_operator
sh_hello_operator >> py_hello_operator
示例dag中包含常用的BashOperator和PythonOperator,以及task之间的依赖关系
页面上看起来是这样的
Airflow Python script is really just a configuration file specifying the DAG’s structure as code. The actual tasks defined here will run in a different context from the context of this script. Different tasks run on different workers at different points in time, which means that this script cannot be used to cross communicate between tasks.
People sometimes think of the DAG definition file as a place where they can do some actual data processing - that is not the case at all! The script’s purpose is to define a DAG object. It needs to evaluate quickly (seconds, not minutes) since the scheduler will execute it periodically to reflect the changes if any.
airflow的python脚本只是定义dag的结构,实际执行时每个task都会在不同的worker或者不同的context下执行,所以不要在脚本中传递变量或者执行实际业务逻辑,脚本会被scheduler定期执行来刷新dag;
Airflow leverages the power of Jinja Templating and provides the pipeline author with a set of built-in parameters and macros. Airflow also provides hooks for the pipeline author to define their own parameters, macros and templates.
dag脚本中支持jinja模板,jinja模板详见:http://jinja.pocoo.org/docs/dev/api/
参考:http://airflow.apache.org/tutorial.html#it-s-a-dag-definition-file
2 本地测试dag及task执行
Time to run some tests. First let’s make sure that the pipeline parses. Let’s assume we’re saving the code from the previous step in tutorial.py in the DAGs folder referenced in your airflow.cfg. The default location for your DAGs is ~/airflow/dags.
airflow run|test 都可以执行task,区别是run会进行很多检查,比如:
执行task之后日志位于~/airflow/logs/$dag_id/$task_id/下;
3 启动服务器
将定义dag的py文件拷贝到$AIRFLOW_HOME/dags/目录下,scheduler会自动发现和加载,日志位于$AIRFLOW_HOME/logs/$dag_id/$task_id/目录下,airflow会定期从dags目录加载dag
访问 http://$server_ip:8080/admin/
4 高可用集群
airflow中web server和worker都可以启动多个,但是scheduler只能启动一个,这样造成了airflow的单点,目前已经有第三方开源方案来解决这个问题:
Airflow Scheduler Failover Controller
地址:https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
实现原理
The Airflow Scheduler Failover Controller (ASFC) is a mechanism that ensures that only one Scheduler instance is running in an Airflow Cluster at a time. This way you don't come across the issues we described in the "Motivation" section above.
You will first need to startup the ASFC on each of the instances you want the scheduler to be running on. When you start up multiple instances of the ASFC one of them takes on the Active state and the other takes on a Standby state. There is a heart beat mechanism setup to track if the Active ASFC is still active. If the Active ASFC misses multiple heart beats, the Standby ASFC becomes active.
The Active ASFC will poll every 10 seconds to see if the scheduler is running on the desired node. If it is not, the ASFC will try to restart the daemon. If the scheduler daemons still doesn't startup, the daemon is started on another node in the cluster.
安装
报错
查看
需要将setup.py中airflow改为apache-airflow,安装之后启动
会报错
重装Flask-Login
重装之后是Flask-Login 0.4.1,满足要求,但是又会报错
所以Airflow Scheduler Failover Controller和airflow1.10.0不兼容;