这是我的问题的一个简单示例:
CREATE TABLE test1 (id SERIAL, key TEXT UNIQUE, value TEXT);
CREATE TABLE test2 (id SERIAL, key TEXT UNIQUE, value TEXT);
INSERT INTO test1 (key, value)
SELECT i::TEXT, 'ABC' || i::TEXT
FROM generate_series(0, 1000000) AS i;
INSERT INTO test2 (key, value)
SELECT i::TEXT, 'ABC' || (i+1000)::TEXT
FROM generate_series(0, 600000) AS i;
INSERT INTO test2 (key, value)
SELECT i::TEXT, 'ABC' || (i+1000)::TEXT
FROM generate_series(1000000, 1200000) AS i;
CREATE INDEX test1_key ON test1 (key);
CREATE INDEX test1_value ON test1 (value);
CREATE INDEX test2_key ON test2 (key);
CREATE INDEX test2_value ON test2 (value);
VACUUM FULL VERBOSE ANALYZE test1;
VACUUM FULL VERBOSE ANALYZE test2;
那是我当前正在使用的查询,但是需要6秒钟以上。
EXPLAIN ANALYZE
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234' OR test2.value = 'ABC1234';
key1 | value1 | key2 | value2
------+---------+------+---------
234 | ABC234 | 234 | ABC1234
1234 | ABC1234 | 1234 | ABC2234
(2 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=27344.05..79728.10 rows=2 width=32) (actual time=5428.635..6097.098 rows=2 loops=1)
Hash Cond: (test1.key = test2.key)
Filter: ((test1.value = 'ABC1234'::text) OR (test2.value = 'ABC1234'::text))
-> Seq Scan on test1 (cost=0.00..16321.01 rows=1000001 width=15) (actual time=0.009..1057.315 rows=1000001 loops=1)
-> Hash (cost=13047.02..13047.02 rows=800002 width=17) (actual time=2231.964..2231.964 rows=800002 loops=1)
Buckets: 65536 Batches: 2 Memory Usage: 14551kB
-> Seq Scan on test2 (cost=0.00..13047.02 rows=800002 width=17) (actual time=0.010..980.232 rows=800002 loops=1)
Total runtime: 6109.042 ms
(8 rows)
在两个表中,只有极少数的数据集可以满足要求,但是似乎没有观察到事实。我可以改用这样的查询:
EXPLAIN ANALYZE
SELECT coalesce(test1.key, test3.key1) AS key1, coalesce(test1.value, test3.value1) AS value1,
coalesce(test2.key, test3.key2) AS key2, coalesce(test2.value, test3.value2) AS value2
FROM (SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM (SELECT key, value FROM test1 WHERE value = 'ABC1234') AS test1
FULL JOIN (SELECT key, value FROM test2 WHERE value = 'ABC1234') AS test2
ON (test1.key = test2.key)) AS test3
LEFT OUTER JOIN test1 ON (test1.key = test3.key2)
LEFT OUTER JOIN test2 ON (test2.key = test3.key1)
WHERE test1.key IS NOT NULL;
key1 | value1 | key2 | value2
------+---------+------+---------
1234 | ABC1234 | 1234 | ABC2234
234 | ABC234 | 234 | ABC1234
(2 rows)
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..33.56 rows=1 width=64) (actual time=0.075..0.083 rows=1 loops=1)
-> Nested Loop (cost=0.00..25.19 rows=1 width=47) (actual time=0.066..0.072 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.00..16.80 rows=1 width=32) (actual time=0.051..0.054 rows=1 loops=1)
-> Index Scan using test2_value_key on test2 (cost=0.00..8.41 rows=1 width=17) (actual time=0.026..0.027 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.020..0.020 rows=0 loops=1)
Index Cond: (public.test1.key = public.test2.key)
Filter: (public.test1.value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.011..0.013 rows=1 loops=1)
Index Cond: ((public.test1.key IS NOT NULL) AND (public.test1.key = public.test2.key))
-> Index Scan using test2_key on test2 (cost=0.00..8.36 rows=1 width=17) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: (public.test2.key = public.test1.key)
Total runtime: 0.139 ms
以下查询比较简单,但仍然太慢:
EXPLAIN ANALYZE
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
OR EXISTS (SELECT 1 FROM test2 t WHERE t.key = test1.key AND t.value = 'ABC1234');
key1 | value1 | key2 | value2
------+---------+------+---------
1234 | ABC1234 | 1234 | ABC2234
234 | ABC234 | 234 | ABC1234
(2 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Merge Left Join (cost=0.00..8446826.32 rows=500001 width=32) (actual time=615.706..1651.370 rows=2 loops=1)
Merge Cond: (test1.key = test2.key)
-> Index Scan using test1_key on test1 (cost=0.00..8398983.25 rows=500001 width=15) (actual time=28.449..734.567 rows=2 loops=1)
Filter: ((value = 'ABC1234'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
SubPlan 1
-> Index Scan using test2_key on test2 t (cost=0.00..8.36 rows=1 width=0) (never executed)
Index Cond: (key = $0)
Filter: (value = 'ABC1234'::text)
SubPlan 2
-> Index Scan using test2_value on test2 t (cost=0.00..8.37 rows=1 width=7) (actual time=0.376..0.380 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test2_key on test2 (cost=0.00..39593.05 rows=800002 width=17) (actual time=0.019..498.456 rows=348894 loops=1)
Total runtime: 1651.453 ms
(13 rows)
所以我的问题是:是否有一个简单的查询会导致类似第二个查询的快速执行计划,或者为规划者提供索引或某种提示。
(对于该示例,我知道只包含一个带有两个值的表是合理的。但是实际上,表更加复杂,并且表方案不能轻易更改。)
PostgreSQL Version: 9.0.3
shared_buffers = 64MB
effective_cache_size = 32MB
work_mem = 16MB
maintenance_work_mem = 32MB
temp_buffers = 8MB
wal_buffers= 1MB
编辑:从基波洛夫建议,这里是UNION版本。为什么普通的OR查询不会选择如此好的计划?
EXPLAIN ANALYZE
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
UNION
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test2.value = 'ABC1234';
key1 | value1 | key2 | value2
------+---------+------+---------
1234 | ABC1234 | 1234 | ABC2234
234 | ABC234 | 234 | ABC1234
(2 rows)
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=33.64..33.66 rows=2 width=32) (actual time=0.114..0.119 rows=2 loops=1)
-> Sort (cost=33.64..33.64 rows=2 width=32) (actual time=0.111..0.113 rows=2 loops=1)
Sort Key: public.test1.key, public.test1.value, public.test2.key, public.test2.value
Sort Method: quicksort Memory: 17kB
-> Append (cost=0.00..33.63 rows=2 width=32) (actual time=0.046..0.097 rows=2 loops=1)
-> Nested Loop Left Join (cost=0.00..16.81 rows=1 width=32) (actual time=0.044..0.050 rows=1 loops=1)
-> Index Scan using test1_value_key on test1 (cost=0.00..8.44 rows=1 width=15) (actual time=0.023..0.024 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test2_key on test2 (cost=0.00..8.36 rows=1 width=17) (actual time=0.014..0.016 rows=1 loops=1)
Index Cond: (public.test1.key = public.test2.key)
-> Nested Loop (cost=0.00..16.80 rows=1 width=32) (actual time=0.036..0.041 rows=1 loops=1)
-> Index Scan using test2_value_key on test2 (cost=0.00..8.41 rows=1 width=17) (actual time=0.019..0.020 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.013..0.015 rows=1 loops=1)
Index Cond: (public.test1.key = public.test2.key)
Total runtime: 0.173 ms
(16 rows)
最佳答案
首先,感谢非常详细的问题。很少有人问之前已经对问题进行了如此详细的研究。
我一直在考虑这个问题,问题似乎在于PostgreSQL要加入所有行,因为test1中每个不匹配的行都可能在test2中匹配,反之亦然。
解决方案是迫使计划者分两步执行查询。一种方法是您已经尝试过的大型UNION查询-强制它在单独的查询中考虑每个表达式。
另一种方法是强制计划者首先找到匹配的键,然后执行联接,因此不会出现歧义:
EXPLAIN ANALYZE
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM (
SELECT key FROM test1 WHERE value='ABC1234'
UNION SELECT key FROM test2 WHERE value='ABC1234'
) AS matching_keys
INNER JOIN test1 USING (key)
LEFT OUTER JOIN test2 USING (key);
Nested Loop Left Join (cost=16.84..34.44 rows=2 width=32) (actual time=0.211..0.280 rows=2 loops=1)
-> Nested Loop (cost=16.84..33.65 rows=2 width=15) (actual time=0.175..0.212 rows=2 loops=1)
-> Unique (cost=16.84..16.85 rows=2 width=6) (actual time=0.132..0.136 rows=2 loops=1)
-> Sort (cost=16.84..16.85 rows=2 width=6) (actual time=0.131..0.132 rows=2 loops=1)
Sort Key: public.test1.key
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..16.83 rows=2 width=6) (actual time=0.058..0.110 rows=2 loops=1)
-> Index Scan using test1_value on test1 (cost=0.00..8.42 rows=1 width=6) (actual time=0.056..0.058 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test2_value on test2 (cost=0.00..8.39 rows=1 width=7) (actual time=0.046..0.047 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.032..0.033 rows=1 loops=2)
Index Cond: (key = public.test1.key)
-> Index Scan using test2_key on test2 (cost=0.00..0.38 rows=1 width=17) (actual time=0.028..0.029 rows=1 loops=2)
Index Cond: (public.test1.key = key)
Total runtime: 0.390 ms
(16 rows)
同样,UNION充当OR的角色。不幸的是,这种方法对于诸如
value>'ABC1234'
的查询仍然表现不佳。您可以通过提高work_mem
进行一些改进。我在这里茫然。至于最后一个问题:
因为PostgreSQL计划者当前缺乏将OR'ed表达式拆分为单独的UNION查询的能力。有一些注意事项使它变得比看起来更难。
PostgreSQL规划器已经非常复杂,但是到目前为止,利用手动重写SQL已经可以实现的优化并不是一个重要的优先事项。
关于sql - 在WHERE语句中使用OR进行慢速JOIN查询,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/5353363/