本文介绍了SparkSQL是否支持子查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Spark Shell中运行此查询,但它给了我错误,

I am running this query in Spark shell but it gives me error,

sqlContext.sql(
 "select sal from samplecsv where sal < (select MAX(sal) from samplecsv)"
).collect().foreach(println)

错误:

从samplecsv中选择sal,其中sal< (从samplecsv中选择MAX(sal)) ^ 在scala.sys.package $ .error(package.scala:27) 有人可以解释我吗,谢谢

select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27) Can anybody explan me,thanks

推荐答案

计划的功能:

  • SPARK-23945 (Column.isin() should accept a single-column DataFrame as input).
  • SPARK-18455 (General support for correlated subquery processing).

Spark 2.0 +

Spark SQL应该同时支持相关和不相关的子查询.参见 SubquerySuite 了解详情.一些示例包括:

Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuite for details. Some examples include:

select * from l where exists (select * from r where l.a = r.c)
select * from l where not exists (select * from r where l.a = r.c)

select * from l where l.a in (select c from r)
select * from l where a not in (select c from r)

不幸的是,到目前为止(Spark 2.0)不可能使用DataFrame DSL表达相同的逻辑.

Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrame DSL.

火花< 2.0

Spark在FROM子句中支持子查询(与Hive< = 0.12相同).

Spark supports subqueries in the FROM clause (same as Hive <= 0.12).

SELECT col FROM (SELECT *  FROM t1 WHERE bar) t2

它根本不支持WHERE子句中的子查询.通常来说,如果不提升为笛卡尔联接,就无法使用Spark来表示任意子查询(特别是相关子查询).

It simply doesn't support subqueries in the WHERE clause.Generally speaking arbitrary subqueries (in particular correlated subqueries) couldn't be expressed using Spark without promoting to Cartesian join.

因为子查询性能通常是典型关系系统中的重要问题,并且每个子查询都可以使用JOIN表示,所以这里没有功能损失.

Since subquery performance is usually a significant issue in a typical relational system and every subquery can be expressed using JOIN there is no loss-of-function here.

这篇关于SparkSQL是否支持子查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 08:56