Spark on YARN 资源管理器:YARN Containers 和 Spark Executors 之间的关系

本文介绍了Spark on YARN 资源管理器:YARN Containers 和 Spark Executors 之间的关系的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 YARN 上的 Spark 新手，不了解 YARN Containers 和 Spark Executors 之间的关系.我根据 yarn-utils.py 脚本的结果尝试了以下配置，可用于找到最佳集群配置.

I'm new to Spark on YARN and don't understand the relation between the YARN Containers and the Spark Executors. I tried out the following configuration, based on the results of the yarn-utils.py script, that can be used to find optimal cluster configuration.

我正在开发的 Hadoop 集群 (HDP 2.4):

The Hadoop cluster (HDP 2.4) I'm working on:

1 个主节点:
- CPU:2 个 CPU，每个 CPU 6 个内核 = 12 个内核
- 内存:64 GB
- SSD:2 个 512 GB
- CPU:2 个 CPU，每个 CPU 6 个内核 = 12 个内核
- 内存:64 GB
- 硬盘:4 x 3 TB = 12 TB
所以我运行了 python yarn-utils.py -c 12 -m 64 -d 4 -k True(c=cores, m=memory, d=hdds, k=hbase-installed)并得到以下结果:
So I ran python yarn-utils.py -c 12 -m 64 -d 4 -k True (c=cores, m=memory, d=hdds, k=hbase-installed) and got the following result:
```
 Using cores=12 memory=64GB disks=4 hbase=True
 Profile: cores=12 memory=49152MB reserved=16GB usableMem=48GB disks=4
 Num Container=8
 Container Ram=6144MB
 Used Ram=48GB
 Unused Ram=16GB
 yarn.scheduler.minimum-allocation-mb=6144
 yarn.scheduler.maximum-allocation-mb=49152
 yarn.nodemanager.resource.memory-mb=49152
 mapreduce.map.memory.mb=6144
 mapreduce.map.java.opts=-Xmx4915m
 mapreduce.reduce.memory.mb=6144
 mapreduce.reduce.java.opts=-Xmx4915m
 yarn.app.mapreduce.am.resource.mb=6144
 yarn.app.mapreduce.am.command-opts=-Xmx4915m
 mapreduce.task.io.sort.mb=2457
```
我通过 Ambari 界面进行的这些设置并重新启动了集群.这些值也与我之前手动计算的大致相符.
These settings I made via the Ambari interface and restarted the cluster. The values also match roughly what I calculated manually before.
我现在有问题
- 为我的 spark-submit 脚本找到最佳设置
  - 参数 --num-executors, --executor-cores &--executor-memory.
  - to find the optimal settings for my spark-submit script
    - parameters --num-executors, --executor-cores & --executor-memory.
    但是，我发现这篇文章 YARN 中的容器是什么?，但这并没有真正帮助，因为它没有描述与执行者的关系.
    However, I found this post What is a container in YARN? , but this didn't really help as it doesn't describe the relation to the executors.
    有人可以帮助解决一个或多个问题吗?
    Can someone help to solve one or more of the questions?
    推荐答案
    我将在这里一步一步报告我的见解:
    I will report my insights here step by step:
    - 首先重要的是这个事实(来源:此 Cloudera 文档):
    在 YARN 上运行 Spark 时，每个 Spark 执行器都作为 YARN 容器运行.[...]
  - 这意味着容器的数量将始终与 Spark 应用程序创建的执行程序相同，例如通过 spark-submit 中的 --num-executors 参数.
    由 yarn.scheduler.minimum-allocation-mb 设置，每个容器总是至少分配这个数量的内存.这意味着如果参数 --executor-memory 设置为例如只有 1g 但 yarn.scheduler.minimum-allocation-mb 是例如6g，容器比 Spark 应用程序需要的大得多.
    Set by the yarn.scheduler.minimum-allocation-mb every container always allocates at least this amount of memory. This means if parameter --executor-memory is set to e.g. only 1g but yarn.scheduler.minimum-allocation-mb is e.g. 6g, the container is much bigger than needed by the Spark application.
    反之，如果参数 --executor-memory 设置为高于 yarn.scheduler.minimum-allocation-mb 的值，例如12g，Container 会动态分配更多内存，但仅当请求的内存量小于或等于 yarn.scheduler.maximum-allocation-mb 值.
    The other way round, if the parameter --executor-memory is set to somthing higher than the yarn.scheduler.minimum-allocation-mb value, e.g. 12g, the Container will allocate more memory dynamically, but only if the requested amount of memory is smaller or equal to yarn.scheduler.maximum-allocation-mb value.
    yarn.nodemanager.resource.memory-mb的值决定了一台主机的所有容器总共可以分配多少内存！
    The value of yarn.nodemanager.resource.memory-mb determines, how much memory can be allocated in sum by all containers of one host!
    => 所以设置 yarn.scheduler.minimum-allocation-mb 可以让你运行更小的容器，例如对于较小的执行器(否则会浪费内存).
    => So setting yarn.scheduler.minimum-allocation-mb allows you to run smaller containers e.g. for smaller executors (else it would be waste of memory).
    => 将 yarn.scheduler.maximum-allocation-mb 设置为最大值(例如等于 yarn.nodemanager.resource.memory-mb) 允许您定义更大的执行器(如果需要，分配更多内存，例如通过 --executor-memory 参数).
    => Setting yarn.scheduler.maximum-allocation-mb to the maximum value (e.g. equal to yarn.nodemanager.resource.memory-mb) allows you to define bigger executors (more memory is allocated if needed, e.g. by --executor-memory parameter).
    
    这篇关于Spark on YARN 资源管理器:YARN Containers 和 Spark Executors 之间的关系的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

yarn

Spark on YARN 资源管理器:YARN Containers 和 Spark Executors 之间的关系

问题描述

推荐答案