问题描述
我正在尝试创建动态 DAG,然后将它们发送到调度程序.我尝试了 https://www.astronomer.io/guides/dynamically-generate-dags/ 效果很好.我在下面的代码中对其进行了一些更改.在调试问题时需要帮助.
I am trying to create Dynamic DAGs and then get them to the scheduler. I tried the reference from https://www.astronomer.io/guides/dynamically-generating-dags/ which works well. I changed it a bit as in the below code. Need help in debugging the issue.
我试过了1. 测试运行文件.Dag 被执行并且 globals() 正在打印所有 DAG 对象.但不知何故没有在 list_dags 或 UI 中列出
I tried1. Test run the file. The Dag gets executed and the globals() is printing all the DAGs objects. But somehow not listing in the list_dags or in the UI
from datetime import datetime, timedelta
import requests
import json
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.http_operator import SimpleHttpOperator
def create_dag(dag_id,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval="@hourly",
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
def fetch_new_dags(**kwargs):
for n in range(1, 10):
print("=====================START=========\n")
dag_id = "abcd_" + str(n)
print (dag_id)
print("\n")
globals()[dag_id] = create_dag(dag_id, n, default_args)
print(globals())
default_args = {
'owner': 'diablo_admin',
'depends_on_past': False,
'start_date': datetime(2019, 8, 8),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'trigger_rule': 'none_skipped'
#'schedule_interval': '0 * * * *'
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('testDynDags', default_args=default_args, schedule_interval='*/1 * * * *')
#schedule_interval='*/1 * * * *'
check_for_dags = PythonOperator(dag=dag,
task_id='tst_dyn_dag',
provide_context=True,
python_callable=fetch_new_dags
)
check_for_dags
预计动态创建 10 个 DAG 并添加到调度程序中.
Expected to create 10 DAGs dynamically and added to the scheduler.
推荐答案
我想这样做可以解决它
- 彻底删除全局
testDynDags
dag 和tst_dyn_dags
任务(实例化和调用) - 使用全局范围内的必要参数调用您的
fetch_new_dags(..)
方法
- completely remove the global
testDynDags
dag andtst_dyn_dags
task (instantiation and invocation) - invoke your
fetch_new_dags(..)
method with requisite arguments in global scope
说明
- 动态 dag/任务仅意味着您在编写 dag 定义文件时具有明确定义的逻辑,可以帮助以预定义的方式创建具有已知结构的任务/dag.
- 您无法在运行时确定 DAG 的结构(任务执行).因此,例如,如果上游任务返回整数值 n,则不能向 DAG 添加 n 个相同的任务.但是您可以遍历包含 n 个段的 YAML 文件并生成 n 个任务/dag.
- Dynamic dags / tasks merely means that you have a well-defined logic at the time of writing dag-definition file that can help create tasks / dags having a known structure in a pre-defined fashion.
- You can NOT determine the structure of your DAG at runtime (task execution). So, for instance, you cannot add n identical tasks to your DAG if the upstream task returned an integer value n. But you can iterate over a YAML file containing n segments and generate n tasks / dags.
很明显,将 dag 生成代码包装在 Airflow 任务本身是没有意义的.
So clearly, wrapping dag-generation code inside an Airflow task itself makes no sense.
UPDATE-1
从评论中指出的内容来看,我推断要求要求您修改将输入(要创建多少 dag 或任务)提供给 DAG/任务生成脚本的外部源.虽然这确实是一个复杂的用例,但实现这一目标的简单方法是创建 2 个独立的 DAG.
From what is indicated in comments, I infer that the requirement dictates that you revise your external source that feeds inputs (how many dags or tasks to create) to your DAG / task-generation script. While this is indeed a complex use-case, but a simple way to achieve this is to create 2 separate DAGs.
- 一个 dag 不时运行一次并生成存储在外部资源中的输入,如 Airflow Variable(或任何其他外部存储,如文件/S3/数据库等)
- 第二个 DAG 是通过读取第一个 DAG 写入的相同数据源以编程方式构造的
您可以从 根据变量值添加 DAG
部分
You can take inspiration from the Adding DAGs based on Variable value
section
这篇关于调度程序未添加动态 dag的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!