问题描述
来自 Spark 官方 文档,它说:
From the Spark official document, it says:
Spark SQL 可以使用内存中的列格式缓存表调用 sqlContext.cacheTable("tableName") 或 dataFrame.cache().然后Spark SQL 将只扫描需要的列并自动调优压缩以最小化内存使用和 GC 压力.你可以打电话sqlContext.uncacheTable("tableName") 从内存中删除表.
使用内存中的列格式缓存表的真正含义是什么?把整个表放入内存?我们知道缓存也是懒惰的,该表在查询的第一个操作之后被缓存.如果选择不同的操作或查询,对缓存表有什么影响吗?我已经多次搜索了这个缓存主题,但没有找到一些详细的文章.如果有人可以提供有关此主题的链接或文章,我将不胜感激.
What does caching tables using a in-memory columnar format really mean?Put the whole table into the memory? As we know that cache is also lazy,the table is cached after the first action on the query. Does it make any difference to the cached table if choosing different actions or queries? I've googled this cache topic several times but failed to find some detailed articles. I would really appreciate it if anyone can provides some links or articles for this topic.
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
推荐答案
是的,如果您使用以下设置,缓存表会将整个表放在压缩的内存中:spark.sql.inMemoryColumnarStorage.compressed = true.请记住,在对 DataFrame 进行缓存时,它是延迟缓存,这意味着它只会缓存在下一个处理事件中使用的行.因此,如果您对该 DataFrame 进行查询并且仅扫描 100 行,则只会缓存这些行,而不是整个表.但是,如果您在 SQL 中执行 CACHE TABLE MyTableName ,则默认为急切缓存并将缓存整个表.您可以像这样在 SQL 中选择 LAZY 缓存:
Yes, caching the tables put the whole table in memory compressed if you use this setting: spark.sql.inMemoryColumnarStorage.compressed = true. Keep in mind, when doing caching on a DataFrame it is Lazy caching which means it will only cache what rows are used in the next processing event. So if you do a query on that DataFrame and only scan 100 rows, those will only be cached, not the entire table. If you do CACHE TABLE MyTableName in SQL though, it is defaulted to be eager caching and will cache the entire table. You can choose LAZY caching in SQL like so:
CACHE LAZY TABLE MyTableName
这篇关于apache spark sql中的缓存表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!