Postgresql使用left join 优化 not in


场景:
查看在t1表中的数据,而这些数据不能出现在t2表。初学者容易想到 not in。这个问题可以使用left join 解决,如果两个表的结构一致,也可以采用差集来解决


使用例子

  1. 建立测试数据
    新建两个表t1, t2, 每个表有100万的数据。有999990是相同的。现在要找到在t1的数据,而不再t2的数据。

    create table t1(id int, info text);
    create table t2(id int, info text);
    
    insert into t1(id, info) select i, md5(i::text) from generate_series(1, 1000000) t(i);
    insert into t2(id, info) select i, md5(i::text) from generate_series(11, 1000000 + 10) t(i);
    
    create index on t1(id);
    create index on t2(id);
    
    postgres=# select count(*) from t1;
      count
    ---------
     1000000
    (1 row)
    
    postgres=# select count(*) from t2;
      count
    ---------
     1000000
    (1 row) 
  2. 使用not in 效果

    postgres=# explain  select * from t1 where id not in (select id from t2);
                                   QUERY PLAN
    --------------------------------------------------------------------------------
     Gather  (cost=1000.00..6196106209.00 rows=500000 width=37)
       Workers Planned: 2
       ->  Parallel Seq Scan on t1  (cost=0.00..6196055209.00 rows=208333 width=37)
             Filter: (NOT (SubPlan 1))
             SubPlan 1
               ->  Materialize  (cost=0.00..27241.00 rows=1000000 width=4)
                     ->  Seq Scan on t2  (cost=0.00..18334.00 rows=1000000 width=4)
    (7 rows) 
  3. 使用left join 的效果

    postgres=# explain  select t1.* from t1 left join t2 on t1.id = t2.id where t2.* is  null;
                                      QUERY PLAN
    ---------------------------------------------------------------------------------------
     Gather  (cost=23592.00..51286.28 rows=5000 width=37)
       Workers Planned: 2
       ->  Parallel Hash Left Join  (cost=22592.00..49786.28 rows=2083 width=37)
             Hash Cond: (t1.id = t2.id)
             Filter: (t2.* IS NULL)
             ->  Parallel Seq Scan on t1  (cost=0.00..12500.67 rows=416667 width=37)
             ->  Parallel Hash  (cost=12500.67..12500.67 rows=416667 width=65)
                   ->  Parallel Seq Scan on t2  (cost=0.00..12500.67 rows=416667 width=65)
    (8 rows) 

对比结果
6196106209 / 51286 = 120814
基本上相差12万倍

10-03 05:57