问题描述
我了解,在Scala中有2个类型的操作点
I understood the point that in Scala there are 2 types of operations
- 转换
- 操作
像地图转换(),滤波器()都是懒洋洋地评估。所以,这可以优化行动执行完成。例如,如果我先执行动作(),那么星火将优化为只读第一行。
Transformations like map(), filter() are evaluated lazily. So, that optimization can be done on Action execution. For example if I execute action first() then Spark will optimize to read only first line.
但为什么坚持()操作是懒洋洋地评估。因为无论我的方式去,急切地或懒洋洋地,它要持续整个RDD为每个存储级别。
But why persist() operation is evaluated lazily. Because either ways I go, eagerly or lazily, it is going to persist entire RDD as per Storage level.
可以请你详细我为什么坚持()的转化,而不是采取行动。
Can you please detail me why persist() is transformation instead of action.
推荐答案
对于初学者渴望持久性会污染整个管道。 缓存
或坚持
只有前presses意向。这并不意味着我们永远去点时RDD被物化,而且实际上被缓存。而且在有些情况下数据会自动缓存上下文。
For starters eager persistence would pollute a whole pipeline. cache
or persist
only expresses intention. It doesn't mean we'll ever get to the point when RDD is materialized and can be actually cached. Moreover there are contexts where data is cached automatically.
由于无论是方式我走了,急切地或懒洋洋地,它要持续整个RDD为每个存储级别。
这是不完全正确的。事情是,坚持
是不是永久性的。因为它明确规定为 MEMORY_ONLY
持续等级:
It is not exactly true. Thing is, persist
is not persistent. As it is clearly stated in the documentation for MEMORY_ONLY
persistence level:
的如果RDD不装入内存,某些分区不被缓存,并在飞行中的每个在需要他们的时间重新计算。的
使用 MEMORY_AND_DISK
剩余的数据存储到磁盘,但仍可如果没有足够的内存来缓存随后被驱逐。什么是更重要的:
With MEMORY_AND_DISK
remaining data is stored to the disk but still can be evicted if there is not enough memory for subsequent caching. What is even more important:
的星火自动监测每个节点上高速缓存的使用并丢弃了一个最近最少使用(LRU)时尚旧数据分区。的
您还可以争辩说缓存
/ 坚持
是从中为特定IO端执行星火行动语义不同效果 - 。 缓存
是更多,我们想在以后再次使用这块code的火花引擎的提示。
You can also argue that cache
/ persist
is semantically different from Spark actions which are executed for specific IO side-effects. cache
is more a hint for a Spark engine that we may want to reuse this piece of code later.
这篇关于为什么坚持()在星火懒洋洋地评估的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!