问题描述
我能够使用 BashOperator
运行 Spark
作业,但是我想使用 SparkSubmitOperator
使用 Spark
独立模式。
主服务器可以是本地
,纱线
, spark:// HOST:PORT
, mesos:// HOST :PORT
和 k8s:// https://< HOST>:< PORT>
。
您还可以在其他内容中提供以下命令:
{ queue: root.default , deploy_mode:群集, spark_home:, spark_binary: spark-submit,命名空间:默认}
要么将 spark-submit二进制文件放在PATH中,要么将spark-home设置在连接。
I am able to run Spark
job using BashOperator
but I want to use SparkSubmitOperator
for it using Spark
standalone mode.
Here's my DAG
for SparkSubmitOperator
and stack-trace
args = {
'owner': 'airflow',
'start_date': datetime(2018, 5, 24)
}
dag = DAG('spark_job', default_args=args, schedule_interval="*/10 * * * *")
operator = SparkSubmitOperator(
task_id='spark_submit_job',
application='/home/ubuntu/test.py',
total_executor_cores='1',
executor_cores='1',
executor_memory='2g',
num_executors='1',
name='airflow-spark',
verbose=False,
driver_memory='1g',
conf={'master':'spark://xx.xx.xx.xx:7077'},
dag=dag,
)
Looking at source for spark_submit_hook
it seems _resolve_connection()
always sets master=yarn
. How can I change master
properties value by Spark
standalone master URL? Which properties I can set to run Spark
job in standalone mode?
You can either create a new connection using the Airflow Web UI or change the spark-default
connection.
Master can be local
, yarn
, spark://HOST:PORT
, mesos://HOST:PORT
and k8s://https://<HOST>:<PORT>
.
You can also supply the following commands in the extras:
{"queue": "root.default", "deploy_mode": "cluster", "spark_home": "", "spark_binary": "spark-submit", "namespace": "default"}
Either the "spark-submit" binary should be in the PATH or the spark-home is set in the extra on the connection.
这篇关于无法使用SparkSubmitOperator执行Spark作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!