估计Scala Spark作业所需的内存

本文介绍了估计Scala Spark作业所需的内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图发现Spark作业需要多少内存.

I'm attempting to discover how much memory will be required by Spark job.

我执行工作时收到异常:

When I run job I receive exception :

15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:41322+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:11 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: Java heap space

许多带有"15/02/12 12:01:08 INFO rdd.HadoopRDD的消息:输入拆分:file:/c:/data/example.txt:20661 + 20661"被打印出来，为了简洁起见，在这里将其截短.

Many more messages with "15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661"are printed, just truncating them here for brevity.

我正在记录计算结果，经过大约1'000'000的计算，我收到了上述异常.

I'm logging the computations and after approx 1'000'000 calculations I receive above exception.

完成工作所需的计算数量为64'000'000

The number of calculations required to finish job is 64'000'000

当前我正在使用2GB的内存，这是否意味着要在内存中运行此作业而无需任何进一步的代码更改2GB * 64 = 128GB还是这是一种过于简单的预期所需内存的方法?

Currently I'm using 2GB of memory so does this mean to run this job in memory without any further code changes will require2GB * 64 = 128GB or is this a much too simpistic method of anticipating required memory ?

如何生成每个拆分文件，例如"15/02/12 12:01:09 INFO rdd.HadoopRDD:输入拆分:file:/c:/data/example.txt:20661 + 20661"?这些未添加到文件系统中，因为本地计算机上不存在"file:/c:/data/example.txt:20661 + 20661"?

How is each split file such as "15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661" generated ? These are not added to file system as "file:/c:/data/example.txt:20661+20661" does not exist on local machine ?

推荐答案

要估算所需的内存量，我使用了以下方法:

To estimate the amount of required memory I've used this method :

使用 http://code.google.com/p/memory-measurer/，如:计算Java中对象的大小

一旦安装程序可以使用下面的代码来估算Scala集合的大小，这又将指示Spark应用程序所需的内存:

Once setup can use below code to estimate size of Scala collection and in turn this will provide an indication of required memory by Spark application :

object ObjectSizeDriver extends Application {

  val toMeasure = List(1,2,3,4,5,6);

  println(ObjectGraphMeasurer.measure(toMeasure));
  println(MemoryMeasurer.measureBytes(toMeasure));

}

这篇关于估计Scala Spark作业所需的内存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！