问题描述
最近我看到了 Spark 的一些奇怪行为.
Recently I saw some strange behaviour of Spark.
我的应用程序中有一个管道,我在其中操作一个大数据集 - 伪代码:
I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode:
val data = spark.read (...)
data.join(df1, "key") //etc, more transformations
data.cache(); // used to not recalculate data after save
data.write.parquet() // some save
val extension = data.join (..) // more transformations - joins, selects, etc.
extension.cache(); // again, cache to not double calculations
extension.count();
// (1)
extension.write.csv() // some other save
extension.groupBy("key").agg(some aggregations) //
extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset
然而,当我调用 data.unpersist()
即就地 (1)
时,Spark 从 Storage 中删除所有数据集,还有 extension
数据集不是我试图取消持久化的数据集.
However when I call data.unpersist()
i.e. in place (1)
, Spark deletes from Storage all datasets, also the extension
Dataset which is not the dataset I tried to unpersist.
这是预期的行为吗?如何通过 unpersist
在旧数据集上释放一些内存而不取消所有链中的下一个"数据集?
Is that an expected behaviour? How can I free some memory by unpersist
on old Dataset without unpersisting all Dataset that was "next in chain"?
我的设置:
- Spark 版本:当前主版本,RC 2.3
- 斯卡拉:2.11
- Java:OpenJDK 1.8
Question 看起来类似于 了解 Spark 的缓存,但在这里我在不坚持之前做了一些操作.起初我正在计算所有内容,然后保存到存储中 - 我不知道缓存在 RDD 中的工作方式是否与数据集相同
Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets
推荐答案
这是 Spark 缓存的预期行为.Spark 不想保留无效的缓存数据.它完全删除了所有引用数据集的缓存计划.
This is an expected behavior from spark caching. Spark doesn't want to keep invalid cache data. It completely removes all the cached plans refer to the dataset.
这是为了确保查询正确.在示例中,您正在从缓存数据集 data
创建扩展 dataset
.现在如果数据集data
是非持久化的,那么扩展数据集就不能再依赖于缓存的数据集data
.
This is to make sure the query is correct. In the example you are creating extension dataset
from cached dataset data
. Now if the dataset data
is unpersisted essentially extension dataset can no longer rely on the cached dataset data
.
Here is the Pull request for the fix they made. You can see similar JIRA ticket
这篇关于Spark'数据集不持久行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!