本文介绍了Spark如何处理大于群集内存的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我只有1个执行程序,其内存为25 GB,并且一次只能执行一个任务,那么可以处理(转换和操作)1 TB数据(如果是),那么它将如何读取以及在何处读取中间数据将要存储吗?

If I have only 1 executor with memory 25 GB and if it can run only one task at a time then is it possible to process(transformation and action) 1 TB data if Yes then how it will be read and where intermediate data will be store ?

对于同一场景,如果hadoop文件具有300个输入分割,那么RDD中将有300个分区,那么在这种情况下,那些分区在哪里?它将仅保留在hadoop磁盘上,并且我的单个任务将运行300次吗?

Also for the same scenario if hadoop file is having 300 input split then there will be 300 partitions in RDD so in this case where will be those partitions ?will it remain on hadoop disk only and my single task will be run 300 times ?

推荐答案

我在hortonworks网站上找到了很好的答案.

I find a good answer on hortonworks website.

a)简单阅读,不打乱(不加入,...)

a) Simple read no shuffle ( no joins, ... )

对于初始读取,像MapReduce这样的Spark会读取流中的数据,并且>处理它. IE.除非有某种原因Spark无法在内存中实现完整的RDD(除非您要缓存一个小的数据集,否则您可以告诉他这样做)RDD是有弹性的,因为spark知道如何重新创建它(重新从hdfs中读取一个块)例如)并非因为其存储在内存中的不同位置. (不过也可以这样做.)

For the initial reads Spark like MapReduce reads the data in a stream and > processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. )

因此,如果您过滤掉大部分数据或进行有效的聚合(在地图侧进行聚合),则您将永远无法在内存中存储整个表.

So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory.

b)随机播放

此操作与MapReduce非常相似,因为它将映射输出写入磁盘,并通过化简器通过http读取它们.但是,spark在Linux文件系统上使用了激进的文件系统缓冲策略,因此,如果操作系统具有可用的内存,则数据将不会实际写入物理磁盘.

This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc.

c)随机播放后

洗牌后的RDD通常由引擎缓存(否则,发生故障的节点或RDD将需要完全重新运行作业),但是正如abdelkrim提到的那样,除非您否决,否则Spark会将其泄漏到磁盘上.

RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that.

d)Spark Streaming

d) Spark Streaming

这有点不同.除非您覆盖设置,否则Spark Streaming希望所有数据都适合内存.

This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.

这是原始页面.

Matei Zaharia最初的Spark设计论文也有帮助. (第2.6.4节内存不足的行为")

And the initial Spark's design dissertation by Matei Zaharia also helps. (section 2.6.4 Behavior with Insufficient Memory)

希望有一些有用的东西.

Wish there is something useful.

这篇关于Spark如何处理大于群集内存的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 10:20