本文介绍了YARN 中 Spark 应用程序的物理内存不断增加的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 YARN 中运行一个 Spark 应用程序,它有两个执行程序,Xms/Xmx 为 32GB,spark.yarn.excutor.memoryOverhead 为 6GB.

I am running a Spark application in YARN having two executors with Xms/Xmx as 32 GB and spark.yarn.excutor.memoryOverhead as 6 GB.

我看到应用程序的物理内存不断增加,最终被节点管理器杀死:

I am seeing that the application's physical memory is ever increasing and finally gets killed by the node manager:

2015-07-25 15:07:05,354 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=10508,containerID=container_1437828324746_0002_01_000003] is running beyond physical memory limits. Current usage: 38.0 GB of 38 GB physical memory used; 39.5 GB of 152 GB virtual memory used. Killing container.
Dump of the process-tree for container_1437828324746_0002_01_000003 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 10508 9563 10508 10508 (bash) 0 0 9433088 314 /bin/bash -c /usr/java/default/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms32768m -Xmx32768m  -Dlog4j.configuration=log4j-executor.properties -XX:MetaspaceSize=512m -XX:+UseG1GC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc.log -XX:AdaptiveSizePolicyOutputInterval=1  -XX:+UseGCLogFileRotation -XX:GCLogFileSize=500M -XX:NumberOfGCLogFiles=1 -XX:MaxDirectMemorySize=3500M -XX:NewRatio=3 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=36082 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:NativeMemoryTracking=detail -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=512m -XX:CompressedClassSpaceSize=256m -Djava.io.tmpdir=/data/yarn/datanode/nm-local-dir/usercache/admin/appcache/application_1437828324746_0002/container_1437828324746_0002_01_000003/tmp '-Dspark.driver.port=43354' -Dspark.yarn.app.container.log.dir=/opt/hadoop/logs/userlogs/application_1437828324746_0002/container_1437828324746_0002_01_000003 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@nn1:43354/user/CoarseGrainedScheduler 1 dn3 6 application_1437828324746_0002 1> /opt/hadoop/logs/userlogs/application_1437828324746_0002/container_1437828324746_0002_01_000003/stdout 2> /opt/hadoop/logs/userlogs/application_1437828324746_0002/container_1437828324746_0002_01_000003/stderr

我禁用了 YARN 的参数yarn.nodemanager.pmem-check-enabled"并注意到物理内存使用量达到 40GB.

I diabled YARN's parameter "yarn.nodemanager.pmem-check-enabled" and noticed that physical memory usage went till 40 GB.

我查看了 /proc/pid/smaps 中的 RSS 总量,它与 Yarn 报告的物理内存和 top 命令中看到的值相同.

I checked the total RSS in /proc/pid/smaps, and it was same value as physical memory reported by Yarn and seen in top command.

我检查了堆没有问题,但堆外/本机内存中的某些内容正在增加.我使用了 Visual VM 之类的工具,但没有发现任何增加的东西.MaxDirectMmeory 也不超过 600 MB.活动线程的峰值数量为 70-80,线程堆栈大小不超过 100 MB.MetaspaceSize 大约为 60-70MB.

I checked that it's not a problem with the heap, but something is increasing in off heap/ native memory. I used tools like Visual VM, but didn't find anything that's increasing there. MaxDirectMmeory also didn't exceed 600 MB. Peak number of active threads was 70-80 and thread stack size didn't exceed 100 MB. MetaspaceSize was around 60-70 MB.

仅供参考,我使用的是 Spark 1.2 和 Hadoop 2.4.0,我的 Spark 应用程序基于 Spark SQL,它是一个 HDFS 读/写密集型应用程序,并将数据缓存在 Spark SQL 的内存缓存中.

FYI, I am on Spark 1.2 and Hadoop 2.4.0 and my Spark application is based on Spark SQL and it's an HDFS read/write intensive application and caches data in Spark SQL's in-memory caching.

我应该去哪里调试内存泄漏或者是否已经有工具?

Where should I look to debug memory leak or is there a tool already there?

推荐答案

我终于摆脱了这个问题.问题是在 Spark SQL 的 parquet 写入路径中创建的压缩器没有被回收,因此,我的执行者正在为每个 parquet 写入文件创建一个全新的压缩器(从本机内存),从而耗尽物理内存限制.

Finally I was able to get rid of the issue. The issue was that the compressors created in Spark SQL's parquet write path weren't getting recycled and hence, my executors were creating a brand new compressor (from native memory) for every parquet write file and thus exhausting the physical memory limits.

我在 Parquet Jira 中打开了以下错误并为此提出了 PR :-

I had opened the following bug in Parquet Jira and have raised the PR for same :-

https://issues.apache.org/jira/browse/PARQUET-353

这解决了我最后的内存问题.

This fixed the memory issue at my end.

附言- 您只会在 Parquet 写入密集型应用程序中看到此问题.

P.S. - You will see this problem only in a Parquet write intensive application.

这篇关于YARN 中 Spark 应用程序的物理内存不断增加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 08:47