SQL：当涉及到NOT IN和NOT EQUAL TO时，哪个效率更高，为什么？

本文介绍了SQL：当涉及到NOT IN和NOT EQUAL TO时，哪个效率更高，为什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一组项目：

Item1

Item2

Item3

Item4

Item5

Item1
Item2
Item3
Item4
Item5

查询可以通过两种方式构造。首先：

A query can be constructed in two ways. Firstly:

SELECT *
FROM TABLE
WHERE ITEM NOT IN ('item1', 'item2', 'item3', 'item4','item5')

或者，可以这样写：

SELECT *
FROM TABLE
WHERE ITEM != 'item1'
  AND ITEM != 'item2'
  AND ITEM != 'item3'
  AND ITEM != 'item4'
  AND ITEM != 'item5'

哪个效率更高？为什么？

在什么时候效率会比另一个？换句话说，如果有500个物品怎么办？

我的问题专门与PostgreSQL有关。

My question is specifically relating to PostgreSQL.

推荐答案

在PostgreSQL中，在合理的列表长度上通常会有相当小的差异，尽管 IN 在概念上要干净得多。 很长且...<> ... 列表和很长的 NOT IN 列表都表现出色，其中 AND 比不能输入更糟糕。

In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though IN is much cleaner conceptually. Very long AND ... <> ... lists and very long NOT IN lists both perform terribly, with AND much worse than NOT IN.

在两种情况下，如果它们足够长，甚至可以要求您问题，您应该对值列表进行反联接或子查询排除测试。

In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead.

WITH excluded(item) AS (
    VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item);

或：

WITH excluded(item) AS (
    VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
LEFT OUTER JOIN excluded e ON (t.item = e.item)
WHERE e.item IS NULL;

（在现代Pg版本中，两者都会产生相同的查询计划）。

(On modern Pg versions both will produce the same query plan anyway).

如果值列表足够长（数以万计的项目），则查询解析可能会开始耗费大量成本。此时，您应该考虑创建一个 TEMPORARY 表， COPY 将该数据排除在表外，并可能创建一个索引

If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a TEMPORARY table, COPYing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE.

Demo：

CREATE UNLOGGED TABLE exclude_test(id integer primary key);
INSERT INTO exclude_test(id) SELECT generate_series(1,50000);
CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x;

其中 exclude 是要

然后我将相同数据的以下方法与所有结果（以毫秒为单位）进行比较：

I then compare the following approaches on the same data with all results in milliseconds:

不能进入列表： 3424.596

AND .. 。列表： 80173.823

VALUES 基于 JOIN 排除： 20.727

VALUES 基于子查询的排除： 20.495

基于表的 JOIN ，前列表没有索引： 25.183

基于子查询表，前列表没有索引： 23.985

NOT IN list: 3424.596
AND ... list: 80173.823
VALUES based JOIN exclusion: 20.727
VALUES based subquery exclusion: 20.495
Table-based JOIN, no index on ex-list: 25.183
Subquery table based, no index on ex-list: 23.985

...使基于CTE的方法比 AND 列表快三千倍，比 NOT IN 列表。

... making the CTE-based approach over three thousand times faster than the AND list and 130 times faster than the NOT IN list.

此处的代码：（请屏蔽此链接，请遮住您的眼睛）。

Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link).

对于此数据集大小，在排除列表上添加索引没有任何作用。

For this data set size adding an index on the exclusion list made no difference.

注意：

 
   IN   SELECT'IN（'|| string_agg（item :: text，’，’ORDER BY item）|| '）'from exclude;  
 
   AND 列表由 SELECT string_agg（ item :: text，'AND item<>'）from exclude; ）
 
 子查询和基于联接的表排除在重复运行中几乎相同。 
 
 对该计划的检查显示，Pg将 NOT IN 转换为<全部

IN list generated with SELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;
AND list generated with SELECT string_agg(item::text, ' AND item <> ') from exclude;)
Subquery and join based table exclusion were much the same across repeated runs.
Examination of the plan shows that Pg translates NOT IN to <> ALL

所以...您可以看到真正的巨大 IN 和 AND 列表之间的差距与进行适当的联接之间的差距。令我感到惊讶的是，使用 VALUES 列表进行CTE的速度如此之快……解析了 VALUES 列表几乎没有时间，在大多数测试中，执行表方法的速度相同或快于。

So... you can see that there's a truly huge gap between both IN and AND lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a VALUES list was ... parsing the VALUES list took almost no time at all, performing the same or slightly faster than the table approach in most tests.

如果PostgreSQL可以自动识别荒谬的 IN 子句或类似的 AND 条件的链，并切换到更智能的方法，例如进行哈希联接或将其隐式转换为CTE节点。现在，它不知道该怎么做。

It'd be nice if PostgreSQL could automatically recognise a preposterously long IN clause or chain of similar AND conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that.

另请参见：

this handy blog post Magnus Hagander wrote on the topic

                        这篇关于SQL：当涉及到NOT IN和NOT EQUAL TO时，哪个效率更高，为什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

How