因为mysql没有内置的关系除法运算符,所以程序员必须实现自己的关系除法运算符。有两个主要的实现示例,可以在this answer here.
为了子孙后代,我将把它们列在下面:
分组使用

SELECT t.documentid
FROM TABLE t
WHERE t.termid IN (1,2,3)
GROUP BY t.documentid
HAVING COUNT(DISINCT t.termid) = 3

需要注意的是,您必须使用having count(不同于
对于同一个documentid,termid的副本为2将是错误的
肯定的。计数必须等于
in子句。
使用联接
SELECT t.documentid
FROM TABLE t
JOIN TABLE x ON x.termid = t.termid
              AND x.termid = 1
JOIN TABLE y ON y.termid = t.termid
              AND y.termid = 2
JOIN TABLE z ON z.termid = t.termid
              AND z.termid = 3

但这一次对于处理变化很大的标准来说可能是一件痛苦的事。
在这两种实现技术中,哪一种能提供最好的性能?

最佳答案

我在JOIN版本中做了一些改进;见下文。
我投票赞成加速的联合方法。我是这样决定的:
有,版本1

mysql> FLUSH STATUS;
mysql> SELECT city
    ->     FROM us_vch200
    ->     WHERE state IN ('IL', 'MO', 'PA')
    ->     GROUP BY city
    ->     HAVING count(DISTINCT state) >= 3;
+-------------+
| city        |
+-------------+
| Springfield |
| Washington  |
+-------------+
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_external_lock      | 2     |
| Handler_read_first         | 1     |
| Handler_read_key           | 2     |
| Handler_read_last          | 1     |
| Handler_read_next          | 4175  | -- full index scan

(etc)

+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
| id | select_type | table     | type  | possible_keys         | key        | key_len | ref  | rows | Extra                                            |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
|  1 | SIMPLE      | us_vch200 | range | state_city,city_state | city_state | 769     | NULL | 4176 | Using where; Using index for group-by (scanning) |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+

“额外”指出,它决定处理GROUP BY并使用INDEX(city, state),尽管INDEX(state, city)可能有意义。
有,版本2
切换到INDEX(state, city)会产生:
mysql> FLUSH STATUS;
mysql> SELECT city
    ->     FROM us_vch200  IGNORE INDEX(city_state)
    ->     WHERE state IN ('IL', 'MO', 'PA')
    ->     GROUP BY city
    ->     HAVING count(DISTINCT state) >= 3;
+-------------+
| city        |
+-------------+
| Springfield |
| Washington  |
+-------------+
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_external_lock      | 2     |
| Handler_read_key           | 401   |
| Handler_read_next          | 398   |
| Handler_read_rnd           | 398   |
(etc)

+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
| id | select_type | table     | type  | possible_keys         | key        | key_len | ref  | rows | Extra                                    |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
|  1 | SIMPLE      | us_vch200 | range | state_city,city_state | state_city | 2       | NULL |  397 | Using where; Using index; Using filesort |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+

加入
mysql> SELECT x.city
    -> FROM us_vch200 x
    -> JOIN us_vch200 y ON y.city= x.city AND y.state = 'MO'
    -> JOIN us_vch200 z ON z.city= x.city AND z.state = 'PA'
    -> WHERE                                  x.state = 'IL';
+-------------+
| city        |
+-------------+
| Springfield |
| Washington  |
+-------------+
2 rows in set (0.00 sec)

mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_external_lock      | 6     |
| Handler_read_key           | 86    |
| Handler_read_next          | 87    |
(etc)
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
| id | select_type | table | type | possible_keys         | key        | key_len | ref                | rows | Extra                    |
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
|  1 | SIMPLE      | y     | ref  | state_city,city_state | state_city | 2       | const              |   81 | Using where; Using index |
|  1 | SIMPLE      | z     | ref  | state_city,city_state | state_city | 769     | const,world.y.city |    1 | Using where; Using index |
|  1 | SIMPLE      | x     | ref  | state_city,city_state | state_city | 769     | const,world.y.city |    1 | Using where; Using index |
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+

只需要INDEX(state, city)。对于这个公式,处理数是最小的,所以我推断它是最快的。
注意优化器是如何决定从哪个表开始的,可能是由于
+-------+----------+
| state | COUNT(*) |
+-------+----------+
| IL    |      221 |
| MO    |       81 |  -- smallest
| PA    |       96 |
+-------+----------+

结论
JOIN(没有不必要的t表)可能是最快的。此外,还需要这个复合索引:INDEX(state, city)
要转换回您的用例:
city --> documentid
state --> termid

注意:ymmv因为documentid和termid的值分布可能与我使用的测试用例有很大的不同。

10-06 05:17
查看更多