无法使用SparkSubmitOperator执行Spark作业

无法使用SparkSubmitOperator执行Spark作业

本文介绍了无法使用SparkSubmitOperator执行Spark作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我能够使用 BashOperator 运行 Spark 作业,但是我想使用 SparkSubmitOperator 使用 Spark 独立模式








主服务器可以是本地纱线 spark:// HOST:PORT mesos:// HOST :PORT k8s:// https://< HOST>:< PORT>



您还可以在其他内容中提供以下命令:

  { queue: root.default , deploy_mode:群集, spark_home:, spark_binary: spark-submit,命名空间:默认} 


要么将 spark-submit二进制文件放在PATH中,要么将spark-home设置在连接。


I am able to run Spark job using BashOperator but I want to use SparkSubmitOperator for it using Spark standalone mode.


Here's my DAG for SparkSubmitOperator and stack-trace

args = {
    'owner': 'airflow',
    'start_date': datetime(2018, 5, 24)
}
dag = DAG('spark_job', default_args=args, schedule_interval="*/10 * * * *")

operator = SparkSubmitOperator(
    task_id='spark_submit_job',
    application='/home/ubuntu/test.py',
    total_executor_cores='1',
    executor_cores='1',
    executor_memory='2g',
    num_executors='1',
    name='airflow-spark',
    verbose=False,
    driver_memory='1g',
    conf={'master':'spark://xx.xx.xx.xx:7077'},
    dag=dag,
)


Looking at source for spark_submit_hook it seems _resolve_connection() always sets master=yarn. How can I change master properties value by Spark standalone master URL? Which properties I can set to run Spark job in standalone mode?

解决方案

You can either create a new connection using the Airflow Web UI or change the spark-default connection.

Master can be local, yarn, spark://HOST:PORT, mesos://HOST:PORT and k8s://https://<HOST>:<PORT>.

You can also supply the following commands in the extras:

{"queue": "root.default", "deploy_mode": "cluster", "spark_home": "", "spark_binary": "spark-submit", "namespace": "default"}

Either the "spark-submit" binary should be in the PATH or the spark-home is set in the extra on the connection.

这篇关于无法使用SparkSubmitOperator执行Spark作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 23:12