本文介绍了火花交叉连接,两个类似的代码,一个可行,一个不可行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码:

val ori0 = Seq(
  (0l, "1")
).toDF("id", "col1")
val date0 = Seq(
  (0l, "1")
).toDF("id", "date")

val joinExpression = $"col1" === $"date"
ori0.join(date0, joinExpression).show()

val ori = spark.range(1).withColumn("col1", lit("1"))
val date = spark.range(1).withColumn("date", lit("1"))
ori.join(date,joinExpression).show()

第一个连接有效,但是第二个连接有错误:

The first join works,but the second has an error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Range (0, 1, step=1, splits=Some(4))
and
Project [_1#11L AS id#14L, _2#12 AS date#15]
+- Filter (isnotnull(_2#12) && (1 = _2#12))
   +- LocalRelation [_1#11L, _2#12]
Join condition is missing or trivial.

我看过很多次了,不知道为什么是交叉连接,它们之间有什么区别?

I watch it for many time many time, I do not know why it is cross join,and what is the difference between them?

推荐答案

如果要扩展第二个联接,您会发现它实际上等同于:

If you were to expand the second join you'd see that it is really equivalent to:

SELECT *
FROM ori JOIN date
WHERE 1 = 1

很明显,WHERE 1 = 1连接条件很简单,这是Spark检测笛卡尔笛卡尔的条件之一.

Clearly WHERE 1 = 1 join condition trivial, which is one of the conditions under which Spark detects Cartesian.

在第一种情况下不是这样,因为优化器此时无法推断联接列仅包含单个值,并且将尝试应用哈希或排序合并联接.

In the first case this is not the case because optimizer cannot infer at this point that join columns contain only a single value, and will attempt to apply hash or sort merge join.

这篇关于火花交叉连接,两个类似的代码,一个可行,一个不可行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 09:03