在postgresql中,我有两个具有多对多关联的表。第一个表包含活动,这些活动可能计为零或更多原因:

CREATE TABLE activity (
   id integer NOT NULL,
   -- other fields removed for readability
);

CREATE TABLE reason (
   id varchar(1) NOT NULL,
   -- other fields here
);

为了执行关联,在这两个表之间存在一个联接表:
CREATE TABLE activity_reason (
   activity_id integer NOT NULL, -- refers to activity.id
   reason_id varchar(1) NOT NULL, -- refers to reason.id
   CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
  CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);

我想计算一下活动和原因之间可能存在的联系。假设我在表格中有这些记录:
+--------------+------------+
| activity_id  |  reason_id |
+--------------+------------+
|           1  |          A |
|           1  |          B |
|           2  |          A |
|           2  |          B |
|           3  |          A |
|           4  |          C |
|           4  |          D |
|           4  |          E |
+--------------+------------+

我应该有这样的东西:
+-------+---+------+-------+
| count |   |      |       |
+-------+---+------+-------+
|     2 | A | B    | NULL  |
|     1 | A | NULL | NULL  |
|     1 | C | D    | E     |
+-------+---+------+-------+

或者,最终,类似于:
+-------+-------+
| count |       |
+-------+-------+
|     2 | A,B   |
|     1 | A     |
|     1 | C,D,E |
+-------+-------+

我找不到SQL查询来执行此操作。

最佳答案

我们需要比较排序的原因列表来确定相等的集合。

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id) AS reason_list
   FROM  (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

ORDER BY reason_id在最里面的子查询中也可以工作,但是添加activity_id通常更快。
我们完全不需要最里面的子查询。这同样有效:
SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
   FROM   activity_reason
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

但处理所有或大部分表的速度通常较慢。Quoting the manual:
或者,从已排序的子查询中提供输入值通常可以工作。
我们可以使用string_agg()而不是array_agg(),这适用于您使用varchar(1)的示例(对于数据类型"char",btw,这可能更有效)。不过,它可能会在较长的字符串中失败。聚合值可能不明确。
如果reason_id是一个integer(与通常的情况一样),则有另一个更快的解决方案,来自附加模块intarray
SELECT count(*) AS ct, reason_list
FROM  (
   SELECT sort(array_agg(reason_id)) AS reason_list
   FROM   activity_reason2
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

相关,更多解释:
Compare arrays for equality, ignoring order of elements
Storing and comparing unique combinations

09-25 17:00
查看更多