问题描述
据我所知,Spark试图在内存中执行所有计算,除非你调用存储选项。但是,如果我们不使用任何坚持,那么当RDD不适合内存时,Spark会做什么?如果我们有非常大的数据会怎么样? Spark如何处理它而不会崩溃?来自Apache Spark常见问题解答:
如果Spark的操作符不适合内存,Spark的操作员会将数据泄露给磁盘,从而使其可以在任何大小的数据上运行良好。同样,不符合内存要求的缓存数据集可能会溢出到磁盘中,或者在需要时重新计算,这取决于RDD的存储级别。
请参阅以下链接了解更多关于存储级别以及如何在这些级别之间选择适当级别的信息:
As far as I know, Spark tries to do all computation in memory, unless you call persist with disk storage option. If however, we don't use any persist, what does Spark do when an RDD doesn't fit in memory? What if we have very huge data. How will Spark handle it without crashing?
From Apache Spark FAQ's:
Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
Refer below link to know more about storage levels and how to choose appropriate one between these levels: programming-guide.html
这篇关于如果RDD无法装入Spark的内存中会发生什么情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!