如何过滤具有多次通过关系的 SQL 结果

本文介绍了如何过滤具有多次通过关系的 SQL 结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有 student、club 和 student_club 表:

Assuming I have the tables student, club, and student_club:

student {
    id
    name
}
club {
    id
    name
}
student_club {
    student_id
    club_id
}

我想知道如何找到足球 (30) 和棒球 (50) 俱乐部的所有学生.
虽然此查询不起作用，但它是我迄今为止最接近的:

I want to know how to find all students in both the soccer (30) and baseball (50) club.
While this query doesn't work, it's the closest thing I have so far:

SELECT student.*
FROM   student
INNER  JOIN student_club sc ON student.id = sc.student_id
LEFT   JOIN club c ON c.id = sc.club_id
WHERE  c.id = 30 AND c.id = 50

推荐答案

我很好奇.众所周知，好奇心以杀死猫而闻名.

I was curious. And as we all know, curiosity has a reputation for killing cats.

本次测试的猫皮环境:

PostgreSQL 9.0 在 Debian Squeeze 上运行，具有不错的 RAM 和设置.
6,000 名学生，24,000 名俱乐部会员(从具有现实生活数据的类似数据库中复制的数据.)
稍微偏离了问题中的命名模式:student.id 是 student.stud_id 而 club.id 是 club.club_id 在这里.
我在此线程中以其作者的名字命名了这些查询.
我运行了几次所有查询以填充缓存，然后我使用 EXPLAIN ANALYZE 选择了 5 个中最好的.
相关索引(应该是最佳的——只要我们不知道哪些俱乐部会被查询):

PostgreSQL 9.0 on Debian Squeeze with decent RAM and settings.
6.000 students, 24.000 club memberships (data copied from a similar database with real life data.)
Slight diversion from the naming schema in the question: student.id is student.stud_id and club.id is club.club_id here.
I named the queries after their author in this thread.
I ran all queries a couple of times to populate the cache, then I picked the best of 5 with EXPLAIN ANALYZE.
Relevant indexes (should be the optimum - as long as we lack fore-knowledge which clubs will be queried):

ALTER TABLE student ADD CONSTRAINT student_pkey PRIMARY KEY(stud_id );
ALTER TABLE student_club ADD CONSTRAINT sc_pkey PRIMARY KEY(stud_id, club_id);
ALTER TABLE club       ADD CONSTRAINT club_pkey PRIMARY KEY(club_id );
CREATE INDEX sc_club_id_idx ON student_club (club_id);

此处的大多数查询不需要

club_pkey.
主键在 PostgreSQL 中自动实现唯一索引.
最后一个索引是为了弥补多列索引在 PostgreSQL 上:

club_pkey is not required by most queries here.
Primary keys implement unique indexes automatically In PostgreSQL.
The last index is to make up for this known shortcoming of multi-column indexes on PostgreSQL:

多列 B 树索引可以与查询条件一起使用涉及索引列的任何子集，但索引是最多的当对前导(最左侧)列有约束时效率高.

结果

来自 EXPLAIN ANALYZE 的总运行时间.

SELECT s.stud_id, s.name
FROM   student s
JOIN   student_club sc USING (stud_id)
WHERE  sc.club_id IN (30, 50)
GROUP  BY 1,2
HAVING COUNT(*) > 1;

2) 埃尔文 1:33.217 毫秒

SELECT s.stud_id, s.name
FROM   student s
JOIN   (
   SELECT stud_id
   FROM   student_club
   WHERE  club_id IN (30, 50)
   GROUP  BY 1
   HAVING COUNT(*) > 1
   ) sc USING (stud_id);

3) 马丁 1:31.735 毫秒

SELECT s.stud_id, s.name
FROM   student s
WHERE  student_id IN (
   SELECT student_id
   FROM   student_club
   WHERE  club_id = 30

   INTERSECT
   SELECT stud_id
   FROM   student_club
   WHERE  club_id = 50
   );

4) 德里克:2.287 毫秒

SELECT s.stud_id,  s.name
FROM   student s
WHERE  s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 30)
AND    s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 50);

5) 埃尔文 2:2.181 毫秒

SELECT s.stud_id,  s.name
FROM   student s
WHERE  EXISTS (SELECT * FROM student_club
               WHERE  stud_id = s.stud_id AND club_id = 30)
AND    EXISTS (SELECT * FROM student_club
               WHERE  stud_id = s.stud_id AND club_id = 50);

6) 肖恩:2.043 毫秒

SELECT s.stud_id, s.name
FROM   student s
JOIN   student_club x ON s.stud_id = x.stud_id
JOIN   student_club y ON s.stud_id = y.stud_id
WHERE  x.club_id = 30
AND    y.club_id = 50;

最后三个表现几乎相同.4) 和 5) 产生相同的查询计划.

The last three perform pretty much the same. 4) and 5) result in the same query plan.

花哨的 SQL，但性能跟不上:

Fancy SQL, but the performance can't keep up:

SELECT s.stud_id,  s.name
FROM   student AS s
WHERE  NOT EXISTS (
   SELECT *
   FROM   club AS c
   WHERE  c.club_id IN (30, 50)
   AND    NOT EXISTS (
      SELECT *
      FROM   student_club AS sc
      WHERE  sc.stud_id = s.stud_id
      AND    sc.club_id = c.club_id
      )
   );

8) 超立方体 2:147.497 毫秒

SELECT s.stud_id,  s.name
FROM   student AS s
WHERE  NOT EXISTS (
   SELECT *
   FROM  (
      SELECT 30 AS club_id
      UNION  ALL
      SELECT 50
      ) AS c
   WHERE NOT EXISTS (
      SELECT *
      FROM   student_club AS sc
      WHERE  sc.stud_id = s.stud_id
      AND    sc.club_id = c.club_id
      )
   );

不出所料，这两者的表现几乎相同.查询计划导致表扫描，计划器在此处找不到使用索引的方法.

As expected, those two perform almost the same. Query plan results in table scans, the planner doesn't find a way to use the indexes here.

WITH RECURSIVE two AS (
   SELECT 1::int AS level
        , stud_id
   FROM   student_club sc1
   WHERE  sc1.club_id = 30
   UNION
   SELECT two.level + 1 AS level
        , sc2.stud_id
   FROM   student_club sc2
   JOIN   two USING (stud_id)
   WHERE  sc2.club_id = 50
   AND    two.level = 1
   )
SELECT s.stud_id, s.student
FROM   student s
JOIN   two USING (studid)
WHERE  two.level > 1;

Fancy SQL，CTE 性能不错.非常奇特的查询计划.

Fancy SQL, decent performance for a CTE. Very exotic query plan.

WITH sc AS (
   SELECT stud_id
   FROM   student_club
   WHERE  club_id IN (30,50)
   GROUP  BY stud_id
   HAVING COUNT(*) > 1
   )
SELECT s.*
FROM   student s
JOIN   sc USING (stud_id);

查询 2) 的 CTE 变体.令人惊讶的是，对于完全相同的数据，它可能会导致略有不同的查询计划.我在 student 上发现了一个顺序扫描，其中子查询变体使用了索引.

CTE variant of query 2). Surprisingly, it can result in a slightly different query plan with the exact same data. I found a sequential scan on student, where the subquery-variant used the index.

另一个后期添加的超立方体.真是太神奇了，有多少种方法.

Another late addition ypercube. It is positively amazing, how many ways there are.

SELECT s.stud_id, s.student
FROM   student s
JOIN   student_club sc USING (stud_id)
WHERE  sc.club_id = 10                 -- member in 1st club ...
AND    NOT EXISTS (
   SELECT *
   FROM  (SELECT 14 AS club_id) AS c  -- can't be excluded for missing the 2nd
   WHERE  NOT EXISTS (
      SELECT *
      FROM   student_club AS d
      WHERE  d.stud_id = sc.stud_id
      AND    d.club_id = c.club_id
      )
   );

12) 埃尔文 3:2.377 毫秒

ypercube 的 11) 实际上只是这个更简单变体的令人费解的反向方法，它也仍然缺失.表现几乎和顶级猫一样快.

12) erwin 3: 2.377 ms

ypercube's 11) is actually just the mind-twisting reverse approach of this simpler variant, that was also still missing. Performs almost as fast as the top cats.

SELECT s.*
FROM   student s
JOIN   student_club x USING (stud_id)
WHERE  sc.club_id = 10                 -- member in 1st club ...
AND    EXISTS (                        -- ... and membership in 2nd exists
   SELECT *
   FROM   student_club AS y
   WHERE  y.stud_id = s.stud_id
   AND    y.club_id = 14
   );

13) 埃尔文 4:2.375 毫秒

难以置信，但这是另一个真正的新变体.我看到了超过两个会员资格的潜力，但它也仅拥有两个会员资格就跻身顶级猫之列.

13) erwin 4: 2.375 ms

Hard to believe, but here's another, genuinely new variant. I see potential for more than two memberships, but it also ranks among the top cats with just two.

SELECT s.*
FROM   student AS s
WHERE  EXISTS (
   SELECT *
   FROM   student_club AS x
   JOIN   student_club AS y USING (stud_id)
   WHERE  x.stud_id = s.stud_id
   AND    x.club_id = 14
   AND    y.club_id = 10
   );

俱乐部会员动态数量

换句话说:不同数量的过滤器.这个问题正好要求两个俱乐部会员资格.但是许多用例必须为不同的数量做准备.见:

Dynamic number of club memberships

In other words: varying number of filters. This question asked for exactly two club memberships. But many use cases have to prepare for a varying number. See:

在 WHERE 子句中多次使用同一列

这篇关于如何过滤具有多次通过关系的 SQL 结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

analyze