我有一个数据库
books (primary key: bookID)
characterNames (foreign key: books.bookID)
locations (foreign key: books.bookID)
字符名称和位置的文本位置保存在相应的表中。
我正在使用 psycopg2 编写 Pythonscript,在书中查找给定字符名称和位置的所有出现。我只想要书中出现的角色名称和位置。
Here 我已经有了一个搜索一个位置和一个字符的解决方案:
WITH b AS (
SELECT bookid
FROM characternames
WHERE name = 'XXX'
GROUP BY 1
INTERSECT
SELECT bookid
FROM locations
WHERE l.locname = 'YYY'
GROUP BY 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
CTE 'b' 包含所有 bookid,其中出现字符名称 'XXX' 和位置 'YYY'。
现在我还想搜索 2 个地点和一个名称(或分别是 2 个名称和一个地点)。如果所有搜索的实体都必须出现在一本书中,这很简单,但是这样呢:
正在寻找:Tim、Al、Toolshop
结果:书籍包括
(蒂姆、艾尔、工具店)或
(Tim, Al) 或
(蒂姆,工具店)或
(艾尔,工具店)
该问题可以在 4、5、6...条件下重复。
我想过交叉更多子查询,但这行不通。
相反,我会联合找到的 bookID,将它们分组并选择 bookid 出现不止一次:
WITH b AS (
SELECT bookid, count(bookid) AS occurrences
FROM
(SELECT DISTINCT bookid
FROM characterNames
WHERE name='XXX'
UNION
SELECT DISTINCT bookid
FROM characterNames
WHERE name='YYY'
UNION
SELECT DISTINCT bookid
FROM locations
WHERE locname='ZZZ'
GROUP BY bookid)
WHERE occurrences>1)
我认为这有效,目前无法测试,但这是最好的方法吗?
最佳答案
对一般情况使用计数的想法是合理的。不过,对语法进行了一些调整:
WITH b AS (
SELECT bookid
FROM (
SELECT DISTINCT bookid
FROM characterNames
WHERE name='XXX'
UNION ALL
SELECT DISTINCT bookid
FROM characterNames
WHERE name='YYY'
UNION ALL
SELECT DISTINCT bookid
FROM locations
WHERE locname='ZZZ'
) x
GROUP BY bookid
HAVING count(*) > 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
笔记
UNION ALL
(不是 UNION
)来保留子查询之间的重复项。在这种情况下,您希望它们能够计算它们。 DISTINCT
一起使用。你可能想试试 GROUP BY 1
,看看它是否表现更好(我不希望它如此。) GROUP BY
必须在子查询之外。它只会应用于最后一个子查询,并且在那里没有意义,因为您已经拥有 DISTINCT bookid
。 HAVING
子句: HAVING count(*) > 1
您不能在
WHERE
子句中使用聚合值。 在一张 table 上组合条件
您不能简单地在一张表上组合多个条件。您将如何计算发现的数量?但是还有一种更复杂的方法。可能会或可能不会提高性能,您必须进行测试(使用
EXPLAIN ANALYZE
)。两个查询都需要对表 characterNames
至少进行两次索引扫描。至少它缩短了语法。考虑我如何计算
characterNames
的命中数以及如何在外部 sum(hits)
中更改为 SELECT
:WITH b AS (
SELECT bookid
FROM (
SELECT bookid
, max((name='XXX')::int)
+ max((name='YYY')::int) AS hits
FROM characterNames
WHERE (name='XXX'
OR name='YYY')
GROUP BY bookid
UNION ALL
SELECT DISTINCT bookid, 1 AS hits
FROM locations
WHERE locname='ZZZ'
) x
GROUP BY bookid
HAVING sum(hits) > 1
)
...
将
boolean
转换为 integer
为 0
提供 FALSE
,为 1
提供 TRUE
。这有帮助。使用 EXISTS 更快
当我骑自行车去我的公司时,这件事一直在我的脑后踢。我有理由相信这个查询可能会更快。请试一试:
WITH b AS (
SELECT bookid
, (EXISTS (
SELECT *
FROM characterNames c
WHERE c.bookid = b.bookid
AND c.name = 'XXX'))::int
+ (EXISTS (
SELECT *
FROM characterNames c
WHERE c.bookid = b.bookid
AND c.name = 'YYY'))::int AS c_hits
, (EXISTS (
SELECT *
FROM locations l
WHERE l.bookid = b.bookid
AND l.locname='ZZZ'))::int AS l_hits
FROM books b
WHERE (c_hits + l_hits) > 1
)
SELECT c.bookid, c.position, 'char' AS what
FROM b
JOIN characternames c USING (bookid)
WHERE b.c_hits > 0
AND c.name IN ('XXX', 'YYY')
UNION ALL
SELECT l.bookid, l.position, 'loc' AS what
FROM b
JOIN locations l USING (bookid)
WHERE b.l_hits > 0
AND l.locname = 'YYY'
ORDER BY 1,2,3;
EXISTS
半连接可以在第一次匹配时停止执行。由于我们只对 CTE 中全有或全无的答案感兴趣,因此这可能会更快地完成工作。 GROUP BY
)。 关于python - 合并 2 个以上子查询的有效查询,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/10273604/