Apache Spark:客户端和群集部署模式之间的差异

本文介绍了Apache Spark:客户端和群集部署模式之间的差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

TL; DR::在Spark Standalone集群中，客户端和集群部署模式之间有什么区别?如何设置应用程序将在哪种模式下运行?

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?

我们有一个Spark Standalone集群，其中包含三台机器，所有机器都带有Spark 1.6.1:

We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:

一台主计算机，这也是使用spark-submit
两台相同的工作机

A master machine, which also is where our application is run using spark-submit
2 identical worker machines

从Spark文档中，我读到:

但是，通过阅读本文，我并不能真正理解实际的差异，也无法了解不同部署模式的优缺点.

However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.

此外，当我使用start-submit启动应用程序时，即使将属性spark.submit.deployMode设置为"cluster"，上下文的Spark UI也会显示以下条目:

Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:

因此，我无法测试这两种模式以查看实际差异.话虽如此，我的问题是:

So I am not able to test both modes to see the practical differences. That being said, my questions are:

1)Spark Standalone client 部署模式和 cluster 部署模式之间有什么实际区别?使用每一个的利弊是什么?

1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

2)如何使用spark-submit选择要在哪个应用程序上运行?

2) How to I choose which one my application is going to be running on, using spark-submit?

推荐答案

让我们尝试看看客户端和集群模式之间的区别.

Let's try to look at the differences between client and cluster mode.

客户:

驱动程序在专用进程内的专用服务器(主节点)上运行.这意味着它拥有所有可用资源来执行工作.
驱动程序打开一个专用的Netty HTTP服务器，并将指定的JAR文件分发到所有Worker节点(有很大的优势).
因为主节点拥有自己的专用资源，所以您不需要为驱动程序"花费工作"资源.
如果驱动程序进程终止，则需要一个外部监视系统来重置其执行.

集群:

驱动程序在群集的Worker节点之一上运行.工人是由领导者选拔的.
驱动程序作为专用的独立进程运行在工作程序内.
驱动程序至少占用个1核和一个工作线程中的专用内存(可以配置).
可以使用--supervise标志从主节点监视驱动程序，并在驱动程序死后将其重置.
在集群模式下工作时，所有与您的应用程序执行相关的JAR都必须对所有工作人员公开可用.这意味着您可以将它们手动放置在每个工人的共享位置或文件夹中.

Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

哪个更好?不确定，这实际上是您可以尝试和决定的.这不是一个更好的决定，您会从前者和后者中受益，这取决于您哪种情况对您的用例更好.

Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.

选择运行模式的方法是使用--deploy-mode标志.在火花配置页面:

The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:

/bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

这篇关于Apache Spark:客户端和群集部署模式之间的差异的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Driver