火花中的RDD内存占用量

火花中的RDD内存占用量

本文介绍了火花中的RDD内存占用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定内存占用量的概念.当加载一个实木复合地板文件,例如. 1GB并在Spark中创建RDD,每个RDD的内存食物打印内容是什么?

I'm not sure on the concept of memory foot print. When loading a parquet file of eg. 1GB and creating RDDs out of it in Spark, What would be the memory food print for each RDD?

推荐答案

当您从镶木地板文件创建RDD时,在RDD上执行操作(例如,首先收集)之前,不会加载/执行任何操作.

When you create an RDD out of a parquet file, nothing will be loaded/executed until you run an action (e.g., first, collect) on the RDD.

现在,您的内存占用量很可能会随着时间而变化.假设您有100个分区,并且它们大小相等(每个分区10 MB).假设您在具有20个核心的群集上运行,那么在任何时候,您只需要在内存中保存10MB x 20 = 200MB数据即可.

Now your memory footprint will most likely vary over time. Say you have 100 partitions and they are equally-sized (10 MB each). Say you are running on a cluster with 20 cores, then at any point in time you only need to have 10MB x 20 = 200MB data in memory.

此外,鉴于Java对象往往会占用更多空间,要说出1GB文件在JVM Heap中将占用多少空间(假设您加载整个文件)并不容易.可能是我的2倍或更多.

To add on top of this, given that Java objects tend to take more space, it's not easy to say exactly how much space your 1GB file will take in the JVM Heap (assuming you load the entire file). It could me 2x or it can be more.

您可以测试此方法的一个技巧是强制将RDD缓存.然后,您可以在存储"下签入Spark UI,并查看RDD缓存了多少空间.

One trick you can do to test this is force your RDD to be cached. You can then check in the Spark UI under Storage and see how much space that RDD took to cache.

这篇关于火花中的RDD内存占用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-22 14:15