本文介绍了如何通过Airflow将Spark作业提交给EMR集群?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 如何在EMR主群集(由Terraform创建)和Airflow之间建立连接。我在具有相同SG,VPC和子网的AWS EC2服务器下设置了Airflow。 我需要解决方案,以便Airflow可以与EMR对话并执行Spark提交。 下面的代码将列出处于活动状态和已终止的EMR集群,我也可以进行微调以获得活动集群:- 从airflow.contrib.hooks.aws_hook导入AwsHook 导入boto3 hook = AwsHook(aws_conn_id ='aws_default') client = hook.get_client_type('emr','eu-central-1') for a in x: print(x ['Status'] ['State'],x [ '名称']) 我的问题是-如何更新以上代码可以执行Spark提交操作解决方案虽然它可能无法直接解决您的特定查询,但总的来说,这是一些触发通过(远程) EMR 通过 Airflow > 使用 Apache Livy 此解决方案实际上独立于远程服务器,即 EMR 这里是示例 缺点是 Livy 处于早期阶段,其 API 出现 不完整和没想到 使用 EmrSteps API 取决于远程系统: EMR 健壮,但由于它本质上是 async ,因此您还需要 EmrStepSensor (与 EmrAddStepsOperator ) 在单个 EMR 群集上,您不能同时运行多个步骤(尽管有些 hacky 解决方法存在) 使用 SSHHook / SSHOperator 再次独立于远程系统 比较容易上手 如果哟ur spark-submit 命令涉及很多参数,构建该命令(以编程方式)可能会变得很麻烦 EDIT-1 似乎还有另一种简单的方法 指定远程 master -IP 独立于远程系统 需求修改全局配置/环境变量 有关详细信息,请参见 @ cricket_007 的答案 有用的链接 此人来自 @Kaxil Naik 本人:有没有办法在运行master的其他服务器上提交spark作业 将火花提交给运行在EMR上的YARN How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.I need solutions so that Airflow can talk to EMR and execute Spark submit.https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/These blogs have understanding on execution after connection has been established.(Didn't help much)In airflow I have made a connection using UI for AWS and EMR:-Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-from airflow.contrib.hooks.aws_hook import AwsHookimport boto3hook = AwsHook(aws_conn_id=‘aws_default’) client = hook.get_client_type(‘emr’, ‘eu-central-1’) for x in a: print(x[‘Status’][‘State’],x[‘Name’])My question is - How can I update my above code can do Spark-submit actions 解决方案 While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via AirflowUse Apache LivyThis solution is actually independent of remote server, i.e., EMRHere's an exampleThe downside is that Livy is in early stages and its API appears incomplete and wonky to meUse EmrSteps APIDependent on remote system: EMRRobust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)Use SSHHook / SSHOperatorAgain independent of remote systemComparatively easier to get started withIf your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersomeEDIT-1There seems to be another straightforward waySpecifying remote master-IPIndependent of remote systemNeeds modifying Global Configurations / Environment VariablesSee @cricket_007's answer for detailsUseful linksThis one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running masterSpark job submission using Airflow by submitting batch POST method on Livy and tracking jobRemote spark-submit to YARN running on EMR 这篇关于如何通过Airflow将Spark作业提交给EMR集群?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!