使用Spark-SQL进行临时表缓存 | SQL进行临时表缓存

本文介绍了使用Spark-SQL进行临时表缓存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否缓存了用registerTempTable(createOrReplaceTempView带有 spark 2.+ )注册的表?

Is a table registered with registerTempTable (createOrReplaceTempView with spark 2.+) cached?

使用Zeppelin，在进行大量计算之后，在我的Scala代码中注册了一个DataFrame，然后在%pyspark中我要访问它，并对其进行进一步过滤.

Using Zeppelin, I register a DataFrame in my scala code, after heavy computation, and then within %pyspark I want to access it, and further filter it.

它将使用表的内存缓存版本吗?还是每次都会重建?

Will it use a memory-cached version of the table? Or will it be rebuilt each time?

推荐答案

已注册的表未缓存在内存中.

Registered tables are not cached in memory.

createOrReplaceTempView方法将仅使用给定的查询计划创建或替换给定的DataFrame的视图.

The createOrReplaceTempView method will just create or replace a view of the given DataFrame with a given query plan.

如果需要创建永久视图，它将把查询计划转换为规范化的SQL字符串，并将其作为视图文本存储在metastore中.

It will convert the query plan to canonicalized SQL string, and store it as view text in metastore, if we need to create a permanent view.

您需要显式缓存DataFrame.例如:

You'll need to cache your DataFrame explicitly. e.g :

df.createOrReplaceTempView("my_table") # df.registerTempTable("my_table") for spark <2.+
spark.cacheTable("my_table")

让我们用一个例子来说明:

Let's illustrate this with an example :

使用cacheTable :

Using cacheTable :

scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> sc.getPersistentRDDs
// res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> df.createOrReplaceTempView("my_table")

scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> spark.catalog.cacheTable("my_table") // spark.cacheTable("...") before spark 2.0

scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(2 -> In-memory table my_table MapPartitionsRDD[2] at cacheTable at <console>:26)

现在同一个示例使用 cache.createOrReplaceTempView :

Now the same example using cache.createOrReplaceTempView :

scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> df.createOrReplaceTempView("my_table")

scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> df.cache.createOrReplaceTempView("my_table")

scala> sc.getPersistentRDDs
// res6: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
// Map(2 -> ConvertToUnsafe
// +- LocalTableScan [_1#0,_2#1], [[1,2],[b,3]]
//  MapPartitionsRDD[2] at cache at <console>:28)

这篇关于使用Spark-SQL进行临时表缓存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！