问题描述
假设我有一组项目:
- Item1
- Item2
- Item3
- Item4
- Item5
- Item1
- Item2
- Item3
- Item4
- Item5
查询可以通过两种方式构造。首先:
A query can be constructed in two ways. Firstly:
SELECT *
FROM TABLE
WHERE ITEM NOT IN ('item1', 'item2', 'item3', 'item4','item5')
或者,可以这样写:
SELECT *
FROM TABLE
WHERE ITEM != 'item1'
AND ITEM != 'item2'
AND ITEM != 'item3'
AND ITEM != 'item4'
AND ITEM != 'item5'
- 哪个效率更高?为什么?
- 在什么时候效率会比另一个?换句话说,如果有500个物品怎么办?
我的问题专门与PostgreSQL有关。
My question is specifically relating to PostgreSQL.
推荐答案
在PostgreSQL中,在合理的列表长度上通常会有相当小的差异,尽管 IN
在概念上要干净得多。 很长且...<> ...
列表和很长的 NOT IN
列表都表现出色,其中 AND
比不能输入
更糟糕。
In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though IN
is much cleaner conceptually. Very long AND ... <> ...
lists and very long NOT IN
lists both perform terribly, with AND
much worse than NOT IN
.
在两种情况下,如果它们足够长,甚至可以要求您问题,您应该对值列表进行反联接或子查询排除测试。
In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead.
WITH excluded(item) AS (
VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item);
或:
WITH excluded(item) AS (
VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
LEFT OUTER JOIN excluded e ON (t.item = e.item)
WHERE e.item IS NULL;
(在现代Pg版本中,两者都会产生相同的查询计划)。
(On modern Pg versions both will produce the same query plan anyway).
如果值列表足够长(数以万计的项目),则查询解析可能会开始耗费大量成本。此时,您应该考虑创建一个 TEMPORARY
表, COPY
将该数据排除在表外,并可能创建一个索引
If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a TEMPORARY
table, COPY
ing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE.
Demo:
CREATE UNLOGGED TABLE exclude_test(id integer primary key);
INSERT INTO exclude_test(id) SELECT generate_series(1,50000);
CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x;
其中 exclude
是要
然后我将相同数据的以下方法与所有结果(以毫秒为单位)进行比较:
I then compare the following approaches on the same data with all results in milliseconds:
-
不能进入
列表: 3424.596 -
AND .. 。
列表: 80173.823 -
VALUES
基于JOIN
排除: 20.727 -
VALUES
基于子查询的排除: 20.495 - 基于表的
JOIN
,前列表没有索引: 25.183 - 基于子查询表,前列表没有索引: 23.985
NOT IN
list: 3424.596AND ...
list: 80173.823VALUES
basedJOIN
exclusion: 20.727VALUES
based subquery exclusion: 20.495- Table-based
JOIN
, no index on ex-list: 25.183 - Subquery table based, no index on ex-list: 23.985
...使基于CTE的方法比 AND
列表快三千倍,比 NOT IN 列表。
... making the CTE-based approach over three thousand times faster than the
AND
list and 130 times faster than the NOT IN
list.
此处的代码:(请屏蔽此链接,请遮住您的眼睛)。
Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link).
对于此数据集大小,在排除列表上添加索引没有任何作用。
For this data set size adding an index on the exclusion list made no difference.
注意:
-
IN
SELECT'IN('|| string_agg(item :: text,’,’ORDER BY item)|| ')'from exclude;
-
AND
列表由SELECT string_agg( item :: text,'AND item<>')from exclude;
) - 子查询和基于联接的表排除在重复运行中几乎相同。
- 对该计划的检查显示,Pg将
NOT IN
转换为<全部
IN
list generated withSELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;
AND
list generated withSELECT string_agg(item::text, ' AND item <> ') from exclude;
)Subquery and join based table exclusion were much the same across repeated runs.
Examination of the plan shows that Pg translates
NOT IN
to<> ALL
所以...您可以看到真正的巨大
IN
和 AND
列表之间的差距与进行适当的联接之间的差距。令我感到惊讶的是,使用 VALUES
列表进行CTE的速度如此之快……解析了 VALUES
列表几乎没有时间,在大多数测试中,执行表方法的速度相同或快于。
So... you can see that there's a truly huge gap between both
IN
and AND
lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a VALUES
list was ... parsing the VALUES
list took almost no time at all, performing the same or slightly faster than the table approach in most tests.
如果PostgreSQL可以自动识别荒谬的
IN
子句或类似的 AND
条件的链,并切换到更智能的方法,例如进行哈希联接或将其隐式转换为CTE节点。现在,它不知道该怎么做。
It'd be nice if PostgreSQL could automatically recognise a preposterously long
IN
clause or chain of similar AND
conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that.
另请参见:
this handy blog post Magnus Hagander wrote on the topic
这篇关于SQL:当涉及到NOT IN和NOT EQUAL TO时,哪个效率更高,为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!