Postgresql使用left join 优化 not in
场景:
查看在t1表中的数据,而这些数据不能出现在t2表。初学者容易想到 not in。这个问题可以使用left join 解决,如果两个表的结构一致,也可以采用差集来解决
使用例子
建立测试数据
新建两个表t1, t2, 每个表有100万的数据。有999990是相同的。现在要找到在t1的数据,而不再t2的数据。create table t1(id int, info text); create table t2(id int, info text); insert into t1(id, info) select i, md5(i::text) from generate_series(1, 1000000) t(i); insert into t2(id, info) select i, md5(i::text) from generate_series(11, 1000000 + 10) t(i); create index on t1(id); create index on t2(id); postgres=# select count(*) from t1; count --------- 1000000 (1 row) postgres=# select count(*) from t2; count --------- 1000000 (1 row)
使用not in 效果
postgres=# explain select * from t1 where id not in (select id from t2); QUERY PLAN -------------------------------------------------------------------------------- Gather (cost=1000.00..6196106209.00 rows=500000 width=37) Workers Planned: 2 -> Parallel Seq Scan on t1 (cost=0.00..6196055209.00 rows=208333 width=37) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..27241.00 rows=1000000 width=4) -> Seq Scan on t2 (cost=0.00..18334.00 rows=1000000 width=4) (7 rows)
使用left join 的效果
postgres=# explain select t1.* from t1 left join t2 on t1.id = t2.id where t2.* is null; QUERY PLAN --------------------------------------------------------------------------------------- Gather (cost=23592.00..51286.28 rows=5000 width=37) Workers Planned: 2 -> Parallel Hash Left Join (cost=22592.00..49786.28 rows=2083 width=37) Hash Cond: (t1.id = t2.id) Filter: (t2.* IS NULL) -> Parallel Seq Scan on t1 (cost=0.00..12500.67 rows=416667 width=37) -> Parallel Hash (cost=12500.67..12500.67 rows=416667 width=65) -> Parallel Seq Scan on t2 (cost=0.00..12500.67 rows=416667 width=65) (8 rows)
对比结果
6196106209 / 51286 = 120814
基本上相差12万倍