我有一个参数表,有大约30万条记录。
是否可以优化此查询?
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
(SELECT product_id
FROM parameters_products
WHERE parameter_id IN ('2'))
GROUP BY parameter_id
查询输出:
2;274669
解释分析详细。。。输出:
HashAggregate (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
Output: parameters_products.parameter_id, count(parameters_products.product_id)
Group Key: parameters_products.parameter_id
-> Hash Semi Join (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
-> Seq Scan on public.parameters_products (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
Output: parameters_products_1.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products parameters_products_1 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
Output: parameters_products_1.product_id
Filter: (parameters_products_1.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms
PostgreSQL 9.4.1和真空已启用。
刚试过这个任务,但也很慢:
SELECT pp1.parameter_id,
count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id
--
HashAggregate (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
Output: pp1.parameter_id, count(pp1.product_id)
Group Key: pp1.parameter_id
-> Hash Join (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
Output: pp1.parameter_id, pp1.product_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
Output: pp2.product_id
Filter: (pp2.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms
索引:
CREATE INDEX parameters_products_parameter_id_idx
ON parameters_products
USING btree
(parameter_id);
CREATE INDEX parameters_products_product_id_idx
ON parameters_products
USING btree
(product_id);
CREATE INDEX parameters_products_product_id_parameter_id_idx
ON parameters_products
USING btree
(product_id, parameter_id);
EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
-
Hash Left Join (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
Output: pp1.parameter_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2644kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms
SET enable_seqscan = OFF;
缩短了执行时间,但不明显。
最佳答案
我要做的第一件事是用IN
替换EXISTS
:
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
FROM parameters_products pp2
WHERE pp2.product_id = pp.product_id AND
pp2.parameter_id = 2
)
GROUP BY parameter_id;
而且,请确保您在
parameters_products(product_id, parameter_id)
上有一个索引。另一个想法是使用窗口函数:
select parameter_id, count(*)
from (select pp.*,
sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
from parameters_products pp
) pp
where cnt2 > 0
group by parameter_id;
关于sql - 如何优化PostgreSQL COUNT GROUP BY查询?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/33200158/