我有一个参数表,有大约30万条记录。
是否可以优化此查询?

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
    (SELECT product_id
     FROM parameters_products
     WHERE parameter_id IN ('2'))
GROUP BY parameter_id

查询输出:
2;274669

解释分析详细。。。输出:
HashAggregate  (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
  Output: parameters_products.parameter_id, count(parameters_products.product_id)
  Group Key: parameters_products.parameter_id
  ->  Hash Semi Join  (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
        Output: parameters_products.parameter_id, parameters_products.product_id
        Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
        ->  Seq Scan on public.parameters_products  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
              Output: parameters_products.parameter_id, parameters_products.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
              Output: parameters_products_1.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products parameters_products_1  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
                    Output: parameters_products_1.product_id
                    Filter: (parameters_products_1.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms

PostgreSQL 9.4.1和真空已启用。
刚试过这个任务,但也很慢:
SELECT pp1.parameter_id,
       count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id

--
HashAggregate  (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
  Output: pp1.parameter_id, count(pp1.product_id)
  Group Key: pp1.parameter_id
  ->  Hash Join  (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
        Output: pp1.parameter_id, pp1.product_id
        Hash Cond: (pp1.product_id = pp2.product_id)
        ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
              Output: pp1.parameter_id, pp1.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
              Output: pp2.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products pp2  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
                    Output: pp2.product_id
                    Filter: (pp2.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms

索引:
CREATE INDEX parameters_products_parameter_id_idx
  ON parameters_products
  USING btree
  (parameter_id);

CREATE INDEX parameters_products_product_id_idx
  ON parameters_products
  USING btree
  (product_id);

CREATE INDEX parameters_products_product_id_parameter_id_idx
  ON parameters_products
  USING btree
  (product_id, parameter_id);

EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id

-
Hash Left Join  (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
  Output: pp1.parameter_id
  Hash Cond: (pp1.product_id = pp2.product_id)
  ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
        Output: pp1.parameter_id, pp1.product_id
  ->  Hash  (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
        Output: pp2.product_id
        Buckets: 16384  Batches: 4  Memory Usage: 2644kB
        ->  Seq Scan on public.parameters_products pp2  (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
              Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms

SET enable_seqscan = OFF;

缩短了执行时间,但不明显。

最佳答案

我要做的第一件事是用IN替换EXISTS

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
              FROM parameters_products pp2
              WHERE pp2.product_id = pp.product_id AND
                    pp2.parameter_id = 2
             )
GROUP BY parameter_id;

而且,请确保您在parameters_products(product_id, parameter_id)上有一个索引。
另一个想法是使用窗口函数:
select parameter_id, count(*)
from (select pp.*,
             sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
      from parameters_products pp
     ) pp
where cnt2 > 0
group by parameter_id;

关于sql - 如何优化PostgreSQL COUNT GROUP BY查询?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/33200158/

10-12 12:53
查看更多