本文介绍了如何防止谓词下推?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我正在使用 Spark 和 JDBC 数据源.考虑以下片段:

Recently I was working with Spark with JDBC data source. Consider following snippet:

val df = spark.read.(options).format("jdbc").load();
val newDF = df.where(PRED)

PRED 是谓词列表.

PRED is a list of predicates.

如果 PRED 是一个简单的谓词,比如 x = 10,查询会快很多.但是,如果有一些非对等条件,例如 date >someOtherDate 或 date ,查询比没有谓词下推要慢得多.您可能知道,数据库引擎对此类谓词的扫描速度非常慢,在我的情况下甚至慢 10 倍(!).

If PRED is a simple predicate, like x = 10, query will be much faster. However, if there are some non-equi conditions like date > someOtherDate or date < someOtherDate2, query is much slower than without predicate pushdown. As you may know, DB engines scans of such predicates are very slow, in my case with even 10 times slower (!).

为了防止不必要的谓词下推,我使用了:

To prevent unnecessary predicate pushdown I used:

val cachedDF = df.cache()
val newDF = cachedDF.where(PRED)

但它需要大量内存并且 - 由于这里提到的问题 - Spark'数据集非持久化行为 - 我无法取消持久化 cachedDF.

But it requires a lot of memory and - due to problem mentioned here - Spark' Dataset unpersist behaviour - I can't unpersist cachedDF.

还有其他选择可以避免下推谓词吗?没有缓存也没有编写自己的数据源?

Is there any other option to avoid pushing down predicates? Without caching and without writing own data source?

注意:即使有关闭谓词下推的选项,它只适用于其他查询可能仍然使用它.所以,如果我写道:

Note: Even if there is an option to turn off predicate pushdown, it's applicable only is other query may still use it. So, if I wrote:

// some fancy option set to not push down predicates
val df1 = ...
// predicate pushdown works again
val df2 = ...
df1.join(df2)// where df1 without predicate pushdown, but df2 with

推荐答案

已针对此问题打开了 JIRA 票证.你可以在这里关注它:https://issues.apache.org/jira/browse/SPARK-24288

A JIRA ticket has been opened for this issue. You can follow it here :https://issues.apache.org/jira/browse/SPARK-24288

这篇关于如何防止谓词下推?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 13:44