问题描述
来自Spark官方文档,它说:
From the Spark official document, it says:
使用内存列式格式缓存表的真正含义是什么?将整个表放入内存中?我们知道缓存也很懒,在对查询执行第一个操作后,将缓存该表.如果选择不同的操作或查询,对缓存的表有什么影响吗?我已经多次搜索了此缓存主题,但是找不到一些详细的文章.如果有人可以为此主题提供一些链接或文章,我将不胜感激.
What does caching tables using a in-memory columnar format really mean?Put the whole table into the memory? As we know that cache is also lazy,the table is cached after the first action on the query. Does it make any difference to the cached table if choosing different actions or queries? I've googled this cache topic several times but failed to find some detailed articles. I would really appreciate it if anyone can provides some links or articles for this topic.
http://spark. apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
推荐答案
是的,如果使用以下设置,对表进行缓存会将整个表压缩到内存中:spark.sql.inMemoryColumnarStorage.compressed = true.请记住,在DataFrame上进行缓存时,它是惰性缓存,这意味着它将仅缓存在下一个处理事件中使用的行.因此,如果您在该DataFrame上执行查询并仅扫描100行,则这些行将仅被缓存,而不是整个表.但是,如果您在SQL中执行CACHE TABLE MyTableName,则默认情况下它渴望进行缓存,并且将缓存整个表.您可以像这样在SQL中选择LAZY缓存:
Yes, caching the tables put the whole table in memory compressed if you use this setting: spark.sql.inMemoryColumnarStorage.compressed = true. Keep in mind, when doing caching on a DataFrame it is Lazy caching which means it will only cache what rows are used in the next processing event. So if you do a query on that DataFrame and only scan 100 rows, those will only be cached, not the entire table. If you do CACHE TABLE MyTableName in SQL though, it is defaulted to be eager caching and will cache the entire table. You can choose LAZY caching in SQL like so:
CACHE LAZY TABLE MyTableName
这篇关于在Apache Spark SQL中缓存表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!