使用内部联接group by改善jsonb交叉联接上ORDER BY的性能

本文介绍了使用内部联接group by改善jsonb交叉联接上ORDER BY的性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在PostgreSQL 12中有两个表:一个数据集有很多 cfiles 和一个 cfile 有一个 dataset

I have two tables in PostgreSQL 12: a dataset has many cfiles and a cfile has one dataset

SELECT * FROM datasets;
 id |   name
----+----------
  1 | dataset1
  2 | dataset2

SELECT * FROM cfiles;
 id | dataset_id |                property_values (jsonb)
----+------------+-----------------------------------------------
  1 |          1 | {"Sample Names": ["SampA", "SampB", "SampC"]}
  2 |          1 | {"Sample Names": ["SampA", "SampB", "SampD"]}
  3 |          1 | {"Sample Names": ["SampE"]}
  4 |          2 | {"Sample Names": ["SampA, SampF"]}
  5 |          2 | {"Sample Names": ["SampG"]}

我正在尝试获得以下结果:

I am trying to get this result:

 id |   name   |           sample_names
----+----------+-----------------------------------
  1 | dataset1 | SampA; SampB; SampC; SampD; SampE
  2 | dataset2 | SampA, SampF; SampG

根据此这样的问题和很好的答案，我有下面的查询:

Following from this SO question and great answer, I have the query below:

SELECT datasets.id, datasets.name,
string_agg(DISTINCT sn.sample_names, '; ' ORDER BY sn.sample_names) as sample_names
FROM cfiles
CROSS JOIN jsonb_array_elements_text(cfiles.property_values -> 'Sample Names') as sn(sample_names)
JOIN datasets on cfiles.dataset_id=datasets.id
GROUP BY datasets.id, datasets.name
-- Problematic line:
-- ORDER BY datasets.name
LIMIT 20;

这很好，直到我需要订购结果为止.

This works very well until I need to order the result.

对于不具有 ORDER BY 〜12ms且具有 ORDER BY 〜58881ms

For 45K cfile rows without ORDER BY ~12ms, with ORDER BY ~58881ms

下面是我的原始查询(来自上面的SO问题)，它不太优雅，使用字符串操作，但在〜5150ms时的交叉连接性能要好10倍

Below is my original query (from SO question above) that is far less elegant and uses string manipulation but outperforms the cross join by 10X at ~5150ms

SELECT datasets.id,
datasets.name,
ARRAY_TO_STRING(
  ARRAY(
    SELECT DISTINCT * FROM unnest(
      STRING_TO_ARRAY(
        STRING_AGG(
          DISTINCT REPLACE(
            REPLACE(
              REPLACE(
                REPLACE(
                  cfiles.property_values ->> 'Sample Names', '",' || chr(32) || '"', ';'
                ), '[' , ''
              ), '"' , ''
            ), ']' , ''
          ), ';'
        ), ';'
      )
    ) ORDER BY 1 ASC
  ), '; '
) as sample_names
FROM datasets
JOIN cfiles ON cfiles.dataset_id=datasets.id
GROUP BY datasets.id, datasets.name
ORDER BY datasets.name
LIMIT 20;

有什么方法可以提高上述交叉连接查询的性能(包括 ORDER BY )，使其比字符串处理方法更快?

Is there any way I can improve the performance of the cross join query above (including the ORDER BY) to bring it down to be faster than the string manipulation alternative?

交叉连接查询的查询计划无 ORDER BY

Query Plan for the cross join query without ORDER BY

                                                                                                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.69..4351.18 rows=20 width=106) (actual time=0.409..11.706 rows=20 loops=1)
   Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
   ->  GroupAggregate  (cost=0.69..132907.88 rows=611 width=106) (actual time=0.407..11.694 rows=20 loops=1)
         Output: datasets.id, datasets.name, string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names)
         Group Key: datasets.id
         ->  Nested Loop  (cost=0.69..109992.24 rows=4581600 width=106) (actual time=0.065..10.742 rows=207 loops=1)
               Output: datasets.id, datasets.name, sn.sample_names
               ->  Merge Join  (cost=0.69..18360.24 rows=45816 width=527) (actual time=0.042..5.155 rows=1697 loops=1)
                     Output: cfiles.property_values, datasets.id, datasets.name
                     Inner Unique: true
                     Merge Cond: (cfiles.dataset_id = datasets.id)
                     ->  Index Scan using index_cfiles_dataset_id_path on public.cfiles  (cost=0.41..17682.45 rows=45816 width=461) (actual time=0.016..2.665 rows=1697 loops=1)
                           Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
                     ->  Index Scan using datasets_pkey on public.datasets  (cost=0.28..103.56 rows=611 width=74) (actual time=0.016..0.066 rows=27 loops=1)
                           Output: datasets.id, datasets.tid, datasets.bucket_path_id, datasets.path, datasets.name, datasets.last_modified, datasets.file_count, datasets.size, datasets.content_types, datasets.locked, datasets.created_at, datasets.updated_at
               ->  Function Scan on pg_catalog.jsonb_array_elements_text sn  (cost=0.01..1.00 rows=100 width=32) (actual time=0.002..0.002 rows=0 loops=1697)
                     Output: sn.sample_names
                     Function Call: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
 Planning Time: 0.926 ms
 Execution Time: 11.845 ms
(20 rows)

使用 ORDER BY

Query Plan for the cross join query with ORDER BY

                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=130727.27..130727.32 rows=20 width=106) (actual time=60970.131..60970.140 rows=20 loops=1)
   Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
   ->  Sort  (cost=130727.27..130728.79 rows=611 width=106) (actual time=60970.128..60970.132 rows=20 loops=1)
         Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
         Sort Key: datasets.name
         Sort Method: top-N heapsort  Memory: 27kB
         ->  GroupAggregate  (cost=10585.66..130711.01 rows=611 width=106) (actual time=112.152..60965.350 rows=598 loops=1)
               Output: datasets.id, datasets.name, string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names)
               Group Key: datasets.id
               ->  Nested Loop  (cost=10585.66..107795.37 rows=4581600 width=106) (actual time=111.856..4959.284 rows=3289130 loops=1)
                     Output: datasets.id, datasets.name, sn.sample_names
                     ->  Gather Merge  (cost=10585.66..16163.37 rows=45816 width=527) (actual time=111.828..207.605 rows=45816 loops=1)
                           Output: cfiles.property_values, datasets.id, datasets.name
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Merge Join  (cost=9585.63..9875.04 rows=19090 width=527) (actual time=100.410..132.173 rows=15272 loops=3)
                                 Output: cfiles.property_values, datasets.id, datasets.name
                                 Inner Unique: true
                                 Merge Cond: (cfiles.dataset_id = datasets.id)
                                 Worker 0: actual time=94.756..119.875 rows=12140 loops=1
                                 Worker 1: actual time=95.064..120.437 rows=12454 loops=1
                                 ->  Sort  (cost=9529.25..9576.97 rows=19090 width=461) (actual time=99.259..114.951 rows=15272 loops=3)
                                       Output: cfiles.property_values, cfiles.dataset_id
                                       Sort Key: cfiles.dataset_id
                                       Sort Method: external merge  Disk: 10192kB
                                       Worker 0:  Sort Method: external merge  Disk: 5568kB
                                       Worker 1:  Sort Method: external merge  Disk: 5592kB
                                       Worker 0: actual time=93.461..105.574 rows=12140 loops=1
                                       Worker 1: actual time=93.784..105.796 rows=12454 loops=1
                                       ->  Parallel Seq Scan on public.cfiles  (cost=0.00..4188.90 rows=19090 width=461) (actual time=0.028..21.442 rows=15272 loops=3)
                                             Output: cfiles.property_values, cfiles.dataset_id
                                             Worker 0: actual time=0.036..22.118 rows=12140 loops=1
                                             Worker 1: actual time=0.035..22.162 rows=12454 loops=1
                                 ->  Sort  (cost=56.38..57.91 rows=611 width=74) (actual time=1.133..1.334 rows=603 loops=3)
                                       Output: datasets.id, datasets.name
                                       Sort Key: datasets.id
                                       Sort Method: quicksort  Memory: 110kB
                                       Worker 0:  Sort Method: quicksort  Memory: 110kB
                                       Worker 1:  Sort Method: quicksort  Memory: 110kB
                                       Worker 0: actual time=1.272..1.471 rows=611 loops=1
                                       Worker 1: actual time=1.259..1.474 rows=611 loops=1
                                       ->  Seq Scan on public.datasets  (cost=0.00..28.11 rows=611 width=74) (actual time=0.100..0.584 rows=611 loops=3)
                                             Output: datasets.id, datasets.name
                                             Worker 0: actual time=0.155..0.719 rows=611 loops=1
                                             Worker 1: actual time=0.121..0.667 rows=611 loops=1
                     ->  Function Scan on pg_catalog.jsonb_array_elements_text sn  (cost=0.01..1.00 rows=100 width=32) (actual time=0.051..0.067 rows=72 loops=45816)
                           Output: sn.sample_names
                           Function Call: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
 Planning Time: 0.894 ms
 Execution Time: 60972.185 ms
(50 rows)

更新2:以下@bobflux查询的查询计划将其降至9ms！

UPDATE 2: Query Plan for @bobflux's query below got it down to 9ms!

                                                                                                                                                                                                                                                                             QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------
 Sort  (cost=25228.68..25228.73 rows=20 width=72) (actual time=8.166..8.177 rows=14 loops=1)
   Output: ds_1.dataset_id, ds.dataset_name, (string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))))
   Sort Key: ds.dataset_name
   Sort Method: quicksort  Memory: 26kB
   CTE ds
     ->  Limit  (cost=0.69..16.76 rows=20 width=74) (actual time=0.059..0.313 rows=20 loops=1)
           Output: datasets.id, datasets.name
           ->  Nested Loop Semi Join  (cost=0.69..488.56 rows=607 width=74) (actual time=0.057..0.302 rows=20 loops=1)
                 Output: datasets.id, datasets.name
                 ->  Index Only Scan using datasets_name_id on public.datasets  (cost=0.28..137.44 rows=611 width=74) (actual time=0.028..0.062 rows=20 loops=1)
                       Output: datasets.name, datasets.id
                       Heap Fetches: 20
                 ->  Index Only Scan using index_cfiles_dataset_id_path on public.cfiles cfiles_1  (cost=0.41..5.79 rows=75 width=8) (actual time=0.010..0.010 rows=1 loops=20)
                       Output: cfiles_1.dataset_id, cfiles_1.path
                       Index Cond: (cfiles_1.dataset_id = datasets.id)
                       Heap Fetches: 0
   ->  Hash Join  (cost=24073.53..25211.48 rows=20 width=72) (actual time=7.261..8.025 rows=14 loops=1)
         Output: ds_1.dataset_id, ds.dataset_name, (string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))))
         Hash Cond: (ds_1.dataset_id = ds.dataset_id)
         ->  GroupAggregate  (cost=24072.88..25207.88 rows=200 width=40) (actual time=6.862..7.602 rows=14 loops=1)
               Output: ds_1.dataset_id, string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))))
               Group Key: ds_1.dataset_id
               ->  Sort  (cost=24072.88..24450.38 rows=151000 width=40) (actual time=6.688..6.744 rows=259 loops=1)
                     Output: ds_1.dataset_id, (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
                     Sort Key: ds_1.dataset_id
                     Sort Method: quicksort  Memory: 44kB
                     ->  ProjectSet  (cost=0.41..5443.72 rows=151000 width=40) (actual time=4.419..6.469 rows=259 loops=1)
                           Output: ds_1.dataset_id, jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
                           ->  Nested Loop  (cost=0.41..4673.62 rows=1510 width=459) (actual time=0.028..4.285 rows=1749 loops=1)
                                 Output: cfiles.property_values, ds_1.dataset_id
                                 ->  CTE Scan on ds ds_1  (cost=0.00..0.40 rows=20 width=8) (actual time=0.001..0.012 rows=20 loops=1)
                                       Output: ds_1.dataset_id, ds_1.dataset_name
                                 ->  Index Scan using index_cfiles_dataset_id_path on public.cfiles  (cost=0.41..232.91 rows=75 width=459) (actual time=0.012..0.129 rows=87 loops=20)
                                       Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
                                       Index Cond: (cfiles.dataset_id = ds_1.dataset_id)
         ->  Hash  (cost=0.40..0.40 rows=20 width=40) (actual time=0.382..0.383 rows=20 loops=1)
               Output: ds.dataset_name, ds.dataset_id
               Buckets: 1024  Batches: 1  Memory Usage: 10kB
               ->  CTE Scan on ds  (cost=0.00..0.40 rows=20 width=40) (actual time=0.067..0.356 rows=20 loops=1)
                     Output: ds.dataset_name, ds.dataset_id
 Planning Time: 1.781 ms
 Execution Time: 8.386 ms
(42 rows)

推荐答案

让我们在postgresl 13上创建测试数据，其中包含600个数据集，45,000个cfile.

Let's create test data on postgresl 13 with 600 datasets, 45k cfiles.

BEGIN;

CREATE TABLE cfiles (
 id SERIAL PRIMARY KEY,
 dataset_id INTEGER NOT NULL,
 property_values jsonb NOT NULL);

INSERT INTO cfiles (dataset_id,property_values)
 SELECT 1+(random()*600)::INTEGER  AS did,
   ('{"Sample Names": ["'||array_to_string(array_agg(DISTINCT prop),'","')||'"]}')::jsonb prop
   FROM (
     SELECT 1+(random()*45000)::INTEGER AS cid,
     'Samp'||(power(random(),2)*30)::INTEGER AS prop
     FROM generate_series(1,45000*4)) foo
   GROUP BY cid;

COMMIT;
CREATE TABLE datasets ( id INTEGER PRIMARY KEY, name TEXT NOT NULL );
INSERT INTO datasets SELECT n, 'dataset'||n FROM (SELECT DISTINCT dataset_id n FROM cfiles) foo;
CREATE INDEX cfiles_dataset ON cfiles(dataset_id);
VACUUM ANALYZE cfiles;
VACUUM ANALYZE datasets;

您的原始查询在这里要快很多，但这可能是因为postgres 13更加聪明.

Your original query is a lot faster here, but that's probably because postgres 13 is just smarter.

 Sort  (cost=114127.87..114129.37 rows=601 width=46) (actual time=658.943..659.012 rows=601 loops=1)
   Sort Key: datasets.name
   Sort Method: quicksort  Memory: 334kB
   ->  GroupAggregate  (cost=0.57..114100.13 rows=601 width=46) (actual time=13.954..655.916 rows=601 loops=1)
         Group Key: datasets.id
         ->  Nested Loop  (cost=0.57..92009.62 rows=4416600 width=46) (actual time=13.373..360.991 rows=163540 loops=1)
               ->  Merge Join  (cost=0.56..3677.61 rows=44166 width=78) (actual time=13.350..113.567 rows=44166 loops=1)
                     Merge Cond: (cfiles.dataset_id = datasets.id)
                     ->  Index Scan using cfiles_dataset on cfiles  (cost=0.29..3078.75 rows=44166 width=68) (actual time=0.015..69.098 rows=44166 loops=1)
                     ->  Index Scan using datasets_pkey on datasets  (cost=0.28..45.29 rows=601 width=14) (actual time=0.024..0.580 rows=601 loops=1)
               ->  Function Scan on jsonb_array_elements_text sn  (cost=0.01..1.00 rows=100 width=32) (actual time=0.003..0.004 rows=4 loops=44166)
 Execution Time: 661.978 ms

此查询首先读取一个大表(cfiles)，并且由于聚合而产生的行要少得多.因此，在减少要联接的行数之后(而不是在此之前)，与数据集联接将更快.让我们移动该联接.另外，我摆脱了不必要的CROSS JOIN，当SELECT中有一个set-returning函数时，postgres将免费执行您想要的操作.

This query reads a big table first (cfiles) and produces much less rows due to aggregation. Thus it will be faster to join with datasets after the number of rows to join is reduced, not before. Let's move that join. Also I got rid of the CROSS JOIN which is unnecessary, when there is a set-returning function in a SELECT postgres will do what you want for free.

SELECT dataset_id, d.name, sample_names FROM (
 SELECT dataset_id, string_agg(sn, '; ') as sample_names FROM (
  SELECT DISTINCT dataset_id,
   jsonb_array_elements_text(cfiles.property_values -> 'Sample Names') AS sn
   FROM cfiles
   ) f GROUP BY dataset_id
  )g JOIN datasets d ON (d.id=g.dataset_id)
 ORDER BY d.name;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=536207.44..536207.94 rows=200 width=46) (actual time=264.435..264.502 rows=601 loops=1)
   Sort Key: d.name
   Sort Method: quicksort  Memory: 334kB
   ->  Hash Join  (cost=536188.20..536199.79 rows=200 width=46) (actual time=261.404..261.784 rows=601 loops=1)
         Hash Cond: (d.id = cfiles.dataset_id)
         ->  Seq Scan on datasets d  (cost=0.00..10.01 rows=601 width=14) (actual time=0.025..0.124 rows=601 loops=1)
         ->  Hash  (cost=536185.70..536185.70 rows=200 width=36) (actual time=261.361..261.363 rows=601 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 170kB
               ->  HashAggregate  (cost=536181.20..536183.70 rows=200 width=36) (actual time=260.805..261.054 rows=601 loops=1)
                     Group Key: cfiles.dataset_id
                     Batches: 1  Memory Usage: 1081kB
                     ->  HashAggregate  (cost=409982.82..507586.70 rows=1906300 width=36) (actual time=244.419..253.094 rows=18547 loops=1)
                           Group Key: cfiles.dataset_id, jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
                           Planned Partitions: 4  Batches: 1  Memory Usage: 13329kB
                           ->  ProjectSet  (cost=0.00..23530.32 rows=4416600 width=36) (actual time=0.030..159.741 rows=163540 loops=1)
                                 ->  Seq Scan on cfiles  (cost=0.00..1005.66 rows=44166 width=68) (actual time=0.006..9.588 rows=44166 loops=1)
 Planning Time: 0.247 ms
 Execution Time: 269.362 ms

那更好.但是我在查询中看到一个LIMIT，这意味着您可能正在做类似分页的操作.在这种情况下，只需要计算整个cfiles表的整个查询，然后由于LIMIT而丢弃大部分结果，如果该大查询的结果可以更改最终结果中是否包含来自数据集的行或不.如果是这种情况，则没有相应cfile的数据集中的行将不会出现在最终结果中，这意味着cfile的内容将影响分页.好吧，我们总可以作弊:要知道是否必须包含数据集中的一行，所需要做的就是从cfiles中找到一行具有该ID的行...

That's better. But I see a LIMIT in your query, which means you're probably doing something like pagination. In this case it is only necessary to compute the whole query for the whole cfiles table and then throw away most of the results due to the LIMIT, IF the results of that big query can change whether a row from datasets is included in the final result or not. If that is the case, then rows in datasets which don't have corresponding cfiles will not appear in the final result, which means the contents of cfiles will affect pagination. Well, we can always cheat: to know if a row from datasets has to be included, all that is required is that ONE row from cfiles exists with that id...

因此，为了知道最终结果中将包含数据集的哪些行，我们可以使用以下两个查询之一:

So, in order to know which rows of datasets will be included in the final result, we can use one of these two queries:

SELECT id FROM datasets WHERE EXISTS( SELECT * FROM cfiles WHERE cfiles.dataset_id = datasets.id )
ORDER BY name LIMIT 20;

SELECT dataset_id FROM
  (SELECT id AS dataset_id, name AS dataset_name FROM datasets ORDER BY dataset_name) f1
  WHERE EXISTS( SELECT * FROM cfiles WHERE cfiles.dataset_id = f1.dataset_id )
  ORDER BY dataset_name
  LIMIT 20;

大约需要2-3毫秒.我们还可以作弊:

Those take about 2-3 milliseconds. We can also cheat:

CREATE INDEX datasets_name_id ON datasets(name,id);

这使它降低到大约300微秒.因此，现在我们获得了将实际使用(而不是丢弃)的dataset_id列表，因此我们可以使用该列表仅对最终结果中实际存在的行执行大型的慢速聚合，这将节省大量的时间不必要的工作...

This brings it down to about 300 microseconds. So, now we got the list of dataset_id that will actually be used (and not thrown away) so we can use that to perform the big slow aggregation only on the rows that will actually be in the final result, which should save a large amount of unnecessary work...

WITH ds AS (SELECT id AS dataset_id, name AS dataset_name
 FROM datasets WHERE EXISTS( SELECT * FROM cfiles WHERE cfiles.dataset_id = datasets.id )
 ORDER BY name LIMIT 20)

SELECT dataset_id, dataset_name, sample_names FROM (
 SELECT dataset_id, string_agg(DISTINCT sn, '; ' ORDER BY sn) as sample_names FROM (
  SELECT dataset_id,
   jsonb_array_elements_text(cfiles.property_values -> 'Sample Names') AS sn
   FROM ds JOIN cfiles USING (dataset_id)
  ) g GROUP BY dataset_id
  ) h JOIN ds USING (dataset_id)
 ORDER BY dataset_name;

这大约需要30毫秒，我也按之前忘记的sample_name进行了排序.它应该适合您的情况.重要的一点是，查询时间不再取决于表cfile的大小，因为它只会处理实际需要的行.

This takes about 30ms, also I put the order by sample_name that I had forgotten before. It should work for your case. An important point is that query time no longer depends on the size of table cfiles, since it will only process the rows that are actually needed.

请发布结果;)

这篇关于使用内部联接group by改善jsonb交叉联接上ORDER BY的性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！