问题描述
假设我有 student
、club
和 student_club
表:
Assuming I have the tables student
, club
, and student_club
:
student {
id
name
}
club {
id
name
}
student_club {
student_id
club_id
}
我想知道如何找到足球 (30) 和棒球 (50) 俱乐部的所有学生.
虽然此查询不起作用,但它是我迄今为止最接近的:
I want to know how to find all students in both the soccer (30) and baseball (50) club.
While this query doesn't work, it's the closest thing I have so far:
SELECT student.*
FROM student
INNER JOIN student_club sc ON student.id = sc.student_id
LEFT JOIN club c ON c.id = sc.club_id
WHERE c.id = 30 AND c.id = 50
推荐答案
我很好奇.众所周知,好奇心以杀死猫而闻名.
I was curious. And as we all know, curiosity has a reputation for killing cats.
本次测试的猫皮环境:
- PostgreSQL 9.0 在 Debian Squeeze 上运行,具有不错的 RAM 和设置.
- 6,000 名学生,24,000 名俱乐部会员(从具有现实生活数据的类似数据库中复制的数据.)
- 稍微偏离了问题中的命名模式:
student.id
是student.stud_id
而club.id
是club.club_id
在这里. - 我在此线程中以其作者的名字命名了这些查询.
- 我运行了几次所有查询以填充缓存,然后我使用
EXPLAIN ANALYZE
选择了 5 个中最好的. - 相关索引(应该是最佳的——只要我们不知道哪些俱乐部会被查询):
- PostgreSQL 9.0 on Debian Squeeze with decent RAM and settings.
- 6.000 students, 24.000 club memberships (data copied from a similar database with real life data.)
- Slight diversion from the naming schema in the question:
student.id
isstudent.stud_id
andclub.id
isclub.club_id
here. - I named the queries after their author in this thread.
- I ran all queries a couple of times to populate the cache, then I picked the best of 5 with
EXPLAIN ANALYZE
. - Relevant indexes (should be the optimum - as long as we lack fore-knowledge which clubs will be queried):
ALTER TABLE student ADD CONSTRAINT student_pkey PRIMARY KEY(stud_id );
ALTER TABLE student_club ADD CONSTRAINT sc_pkey PRIMARY KEY(stud_id, club_id);
ALTER TABLE club ADD CONSTRAINT club_pkey PRIMARY KEY(club_id );
CREATE INDEX sc_club_id_idx ON student_club (club_id);
此处的大多数查询不需要
club_pkey
.
主键在 PostgreSQL 中自动实现唯一索引.
最后一个索引是为了弥补多列索引 在 PostgreSQL 上:
club_pkey
is not required by most queries here.
Primary keys implement unique indexes automatically In PostgreSQL.
The last index is to make up for this known shortcoming of multi-column indexes on PostgreSQL:
多列 B 树索引可以与查询条件一起使用涉及索引列的任何子集,但索引是最多的当对前导(最左侧)列有约束时效率高.
结果
来自 EXPLAIN ANALYZE
的总运行时间.
SELECT s.stud_id, s.name
FROM student s
JOIN student_club sc USING (stud_id)
WHERE sc.club_id IN (30, 50)
GROUP BY 1,2
HAVING COUNT(*) > 1;
2) 埃尔文 1:33.217 毫秒
SELECT s.stud_id, s.name
FROM student s
JOIN (
SELECT stud_id
FROM student_club
WHERE club_id IN (30, 50)
GROUP BY 1
HAVING COUNT(*) > 1
) sc USING (stud_id);
3) 马丁 1:31.735 毫秒
SELECT s.stud_id, s.name
FROM student s
WHERE student_id IN (
SELECT student_id
FROM student_club
WHERE club_id = 30
INTERSECT
SELECT stud_id
FROM student_club
WHERE club_id = 50
);
4) 德里克:2.287 毫秒
SELECT s.stud_id, s.name
FROM student s
WHERE s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 30)
AND s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 50);
5) 埃尔文 2:2.181 毫秒
SELECT s.stud_id, s.name
FROM student s
WHERE EXISTS (SELECT * FROM student_club
WHERE stud_id = s.stud_id AND club_id = 30)
AND EXISTS (SELECT * FROM student_club
WHERE stud_id = s.stud_id AND club_id = 50);
6) 肖恩:2.043 毫秒
SELECT s.stud_id, s.name
FROM student s
JOIN student_club x ON s.stud_id = x.stud_id
JOIN student_club y ON s.stud_id = y.stud_id
WHERE x.club_id = 30
AND y.club_id = 50;
最后三个表现几乎相同.4) 和 5) 产生相同的查询计划.
The last three perform pretty much the same. 4) and 5) result in the same query plan.
花哨的 SQL,但性能跟不上:
Fancy SQL, but the performance can't keep up:
SELECT s.stud_id, s.name
FROM student AS s
WHERE NOT EXISTS (
SELECT *
FROM club AS c
WHERE c.club_id IN (30, 50)
AND NOT EXISTS (
SELECT *
FROM student_club AS sc
WHERE sc.stud_id = s.stud_id
AND sc.club_id = c.club_id
)
);
8) 超立方体 2:147.497 毫秒
SELECT s.stud_id, s.name
FROM student AS s
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT 30 AS club_id
UNION ALL
SELECT 50
) AS c
WHERE NOT EXISTS (
SELECT *
FROM student_club AS sc
WHERE sc.stud_id = s.stud_id
AND sc.club_id = c.club_id
)
);
不出所料,这两者的表现几乎相同.查询计划导致表扫描,计划器在此处找不到使用索引的方法.
As expected, those two perform almost the same. Query plan results in table scans, the planner doesn't find a way to use the indexes here.
WITH RECURSIVE two AS (
SELECT 1::int AS level
, stud_id
FROM student_club sc1
WHERE sc1.club_id = 30
UNION
SELECT two.level + 1 AS level
, sc2.stud_id
FROM student_club sc2
JOIN two USING (stud_id)
WHERE sc2.club_id = 50
AND two.level = 1
)
SELECT s.stud_id, s.student
FROM student s
JOIN two USING (studid)
WHERE two.level > 1;
Fancy SQL,CTE 性能不错.非常奇特的查询计划.
Fancy SQL, decent performance for a CTE. Very exotic query plan.
WITH sc AS (
SELECT stud_id
FROM student_club
WHERE club_id IN (30,50)
GROUP BY stud_id
HAVING COUNT(*) > 1
)
SELECT s.*
FROM student s
JOIN sc USING (stud_id);
查询 2) 的 CTE 变体.令人惊讶的是,对于完全相同的数据,它可能会导致略有不同的查询计划.我在 student
上发现了一个顺序扫描,其中子查询变体使用了索引.
CTE variant of query 2). Surprisingly, it can result in a slightly different query plan with the exact same data. I found a sequential scan on student
, where the subquery-variant used the index.
另一个后期添加的超立方体.真是太神奇了,有多少种方法.
Another late addition ypercube. It is positively amazing, how many ways there are.
SELECT s.stud_id, s.student
FROM student s
JOIN student_club sc USING (stud_id)
WHERE sc.club_id = 10 -- member in 1st club ...
AND NOT EXISTS (
SELECT *
FROM (SELECT 14 AS club_id) AS c -- can't be excluded for missing the 2nd
WHERE NOT EXISTS (
SELECT *
FROM student_club AS d
WHERE d.stud_id = sc.stud_id
AND d.club_id = c.club_id
)
);
12) 埃尔文 3:2.377 毫秒
ypercube 的 11) 实际上只是这个更简单变体的令人费解的反向方法,它也仍然缺失.表现几乎和顶级猫一样快.
12) erwin 3: 2.377 ms
ypercube's 11) is actually just the mind-twisting reverse approach of this simpler variant, that was also still missing. Performs almost as fast as the top cats.
SELECT s.*
FROM student s
JOIN student_club x USING (stud_id)
WHERE sc.club_id = 10 -- member in 1st club ...
AND EXISTS ( -- ... and membership in 2nd exists
SELECT *
FROM student_club AS y
WHERE y.stud_id = s.stud_id
AND y.club_id = 14
);
13) 埃尔文 4:2.375 毫秒
难以置信,但这是另一个真正的新变体.我看到了超过两个会员资格的潜力,但它也仅拥有两个会员资格就跻身顶级猫之列.
13) erwin 4: 2.375 ms
Hard to believe, but here's another, genuinely new variant. I see potential for more than two memberships, but it also ranks among the top cats with just two.
SELECT s.*
FROM student AS s
WHERE EXISTS (
SELECT *
FROM student_club AS x
JOIN student_club AS y USING (stud_id)
WHERE x.stud_id = s.stud_id
AND x.club_id = 14
AND y.club_id = 10
);
俱乐部会员动态数量
换句话说:不同数量的过滤器.这个问题正好要求两个俱乐部会员资格.但是许多用例必须为不同的数量做准备.见:
Dynamic number of club memberships
In other words: varying number of filters. This question asked for exactly two club memberships. But many use cases have to prepare for a varying number. See:
这篇关于如何过滤具有多次通过关系的 SQL 结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!