本文介绍了Spark:从具有不同内存/核心配置的单个JVM作业同时启动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设你有一个带有独立管理器的Spark集群,其中通过 SparkSession 创建了作业客户端应用。客户端应用程序在JVM上运行。为了提高性能,您必须使用不同的配置启动每个作业,请参阅下面的作业类型示例

Suppose you have Spark cluster with Standalone manager, where jobs are scheduled through SparkSession created at client app. Client app runs on JVM. And you have to launch each job with different configs for the sake of performance, see Job types example below.

问题是。

那么你将如何同时启动具有不同会话配置的多个Spark作业?

So how you gonna launch multiple Spark jobs with different session configs simultaneously?

不同会话配置我的意思是:

By different session configs I mean:


  • spark.executor.cores

  • spark.executor.memory

  • spark.kryoserializer.buffer.max

  • spark.scheduler.pool

  • etc

  • spark.executor.cores
  • spark.executor.memory
  • spark.kryoserializer.buffer.max
  • spark.scheduler.pool
  • etc

解决问题的可能方法:


  1. 在同一 SparkSession 中为每个Spark作业设置不同的会话配置。 有可能吗?

  2. 启动另一个JVM只是为了启动另一个 SparkSession ,我可以称之为Spark会话服务。但你永远不会知道你将来会同时推出多少具有不同配置的工作。目前 - 我一次只需要2-3种不同的配置。这可能足够但不灵活。

  3. 使用相同的配置为各种作业创建全局会话。但是这种方法从性能的角度来看是一个底层。

  4. 仅将Spark用于繁重的工作,并在Spark之外运行所有快速搜索任务。但这是一团糟,因为你需要与Spark并行保留另一个解决方案(如Hazelcast),并在它们之间分配资源。此外,这给所有人带来了额外的复杂性:部署,支持等。

  1. Set different session configs for each Spark job within the same SparkSession. Is it possible?
  2. Launch another JVM just to start another SparkSession, something that I could call Spark session service. But you never knew how many jobs with different configs you gonna launch in future simultaneously. At the moment - I need only 2-3 different configs at a time. It's may be enough but not flexible.
  3. Make global session with the same configs for all kinds of jobs. But this approach is a bottom from perspective of performance.
  4. Use Spark only for heavy jobs, and run all quick search tasks outside Spark. But that's a mess, since you need to keep another solution (like Hazelcast) in parallel with Spark, and split resources between them. Moreover, that brings extra complexity for all: deployment, support etc.



工作类型示例



Job types example


  1. 转储庞大的数据库任务。它是CPU低但IO密集的长时间运行任务。因此,您可能希望使用低内存和每个执行程序的内核来启动尽可能多的执行程序。

  2. 重型句柄转储结果任务。它是CPU密集型的,因此您将为每台集群计算机启动一个执行程序,具有最大CPU和核心。

  3. 快速检索数据任务,每台计算机需要一个执行程序,资源最少。

  4. 介于1-2和3之间的中间位置,其中一个作业应占用一半的群集资源。

  5. 等。

  1. Dump huge database task. It's CPU low but IO intensive long running task. So you may want to launch as many executors as you can with low memory and cores per executor.
  2. Heavy handle-dump-results task. It's CPU intensive so you gonna launch one executor per cluster machine, with maximum CPU and cores.
  3. Quick retrieve data task, which requires one executor per machine and minimal resources.
  4. Something in a middle between 1-2 and 3, where a job should take a half of cluster resources.
  5. etc.


推荐答案

Spark standalone为应用程序使用简单的FIFO调度程序。默认情况下,每个应用程序都使用群集中的所有可用节点。每个应用程序,每个用户或全局可以限制节点数量。其他资源,如内存,cpus等,可以通过应用程序的SparkConf对象进行控制。

Spark standalone uses a simple FIFO scheduler for applications. By default, each application uses all the available nodes in the cluster. The number of nodes can be limited per application, per user, or globally. Other resources, such as memory, cpus, etc. can be controlled via the application’s SparkConf object.

Apache Mesos有一个主进程和从进程。主服务器为应用程序提供资源(在Apache Mesos中称为框架),它接受或不接受。因此,声明可用资源和运行作业由应用程序本身决定。 Apache Mesos允许对系统中的资源进行细粒度控制,例如cpu,内存,磁盘和端口。 Apache Mesos还提供资源的课程粒度控制控制,其中Spark预先为每个执行程序分配固定数量的CPU,这些CPU在应用程序退出之前不会被释放。请注意,在同一群集中,某些应用程序可以设置为使用细粒度控制,而其他应用程序则设置为使用过程粒度控制。

Apache Mesos has a master and slave processes. The master makes offers of resources to the application (called a framework in Apache Mesos) which either accepts the offer or not. Thus, claiming available resources and running jobs is determined by the application itself. Apache Mesos allows fine-grained control of the resources in a system such as cpus, memory, disks, and ports. Apache Mesos also offers course-grained control control of resources where Spark allocates a fixed number of CPUs to each executor in advance which are not released until the application exits. Note that in the same cluster, some applications can be set to use fine-grained control while others are set to use course-grained control.

Apache Hadoop YARN有一个ResourceManager,包含两个部分,一个Scheduler和一个ApplicationsManager。 Scheduler是一个可插拔组件。提供了两个实现,一个在多个组织共享的集群中有用的CapacityScheduler,以及FairScheduler,它确保所有应用程序平均获得相同数量的资源。两个调度程序都将应用程序分配给队列,每个队列获取在它们之间平均共享的资源。在队列中,资源在应用程序之间共享。 ApplicationsManager负责接受作业提交并启动特定于应用程序的ApplicationsMaster。在这种情况下,ApplicationsMaster是Spark应用程序。在Spark应用程序中,资源是在应用程序的SparkConf对象中指定的。

Apache Hadoop YARN has a ResourceManager with two parts, a Scheduler, and an ApplicationsManager. The Scheduler is a pluggable component. Two implementations are provided, a CapacityScheduler, useful in a cluster shared by more than one organization, and the FairScheduler, which ensures all applications, on average, get an equal number of resources. Both schedulers assign applications to a queues and each queue gets resources that are shared equally between them. Within a queue, resources are shared between the applications. The ApplicationsManager is responsible for accepting job submissions and starting the application specific ApplicationsMaster. In this case, the ApplicationsMaster is the Spark application. In the Spark application, resources are specified in the application’s SparkConf object.

对于你的情况,只有单独使用它是不可能的,可能有一些前提解决方案,但我避风港不要面对

For your case just with standalone it is not possible may be there can be some premise solutions but I haven't faced

这篇关于Spark:从具有不同内存/核心配置的单个JVM作业同时启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 10:15