问题描述
是否缓存了用registerTempTable
(createOrReplaceTempView
带有 spark 2.+ )注册的表?
Is a table registered with registerTempTable
(createOrReplaceTempView
with spark 2.+) cached?
使用Zeppelin,在进行大量计算之后,在我的Scala代码中注册了一个DataFrame
,然后在%pyspark
中我要访问它,并对其进行进一步过滤.
Using Zeppelin, I register a DataFrame
in my scala code, after heavy computation, and then within %pyspark
I want to access it, and further filter it.
它将使用表的内存缓存版本吗?还是每次都会重建?
Will it use a memory-cached version of the table? Or will it be rebuilt each time?
推荐答案
已注册的表未缓存在内存中.
Registered tables are not cached in memory.
createOrReplaceTempView
方法将仅使用给定的查询计划创建或替换给定的DataFrame
的视图.
The createOrReplaceTempView
method will just create or replace a view of the given DataFrame
with a given query plan.
如果需要创建永久视图,它将把查询计划转换为规范化的SQL字符串,并将其作为视图文本存储在metastore中.
It will convert the query plan to canonicalized SQL string, and store it as view text in metastore, if we need to create a permanent view.
您需要显式缓存DataFrame.例如:
You'll need to cache your DataFrame explicitly. e.g :
df.createOrReplaceTempView("my_table") # df.registerTempTable("my_table") for spark <2.+
spark.cacheTable("my_table")
让我们用一个例子来说明:
Let's illustrate this with an example :
使用cacheTable
:
Using cacheTable
:
scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> sc.getPersistentRDDs
// res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> df.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> spark.catalog.cacheTable("my_table") // spark.cacheTable("...") before spark 2.0
scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(2 -> In-memory table my_table MapPartitionsRDD[2] at cacheTable at <console>:26)
现在同一个示例使用 cache.createOrReplaceTempView
:
Now the same example using cache.createOrReplaceTempView
:
scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> df.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> df.cache.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res6: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
// Map(2 -> ConvertToUnsafe
// +- LocalTableScan [_1#0,_2#1], [[1,2],[b,3]]
// MapPartitionsRDD[2] at cache at <console>:28)
这篇关于使用Spark-SQL进行临时表缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!