我有一个数据库

books          (primary key: bookID)
characterNames (foreign key: books.bookID)
locations      (foreign key: books.bookID)

字符名称和位置的文本位置保存在相应的表中。
我正在使用 psycopg2 编写 Pythonscript,在书中查找给定字符名称和位置的所有出现。我只想要书中出现的角色名称和位置。
Here 我已经有了一个搜索一个位置和一个字符的解决方案:
WITH b AS (
    SELECT bookid
    FROM   characternames
    WHERE  name = 'XXX'
    GROUP  BY 1
    INTERSECT
    SELECT bookid
    FROM   locations
    WHERE  l.locname = 'YYY'
    GROUP  BY 1
    )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'
UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;

CTE 'b' 包含所有 bookid,其中出现字符名称 'XXX' 和位置 'YYY'。

现在我还想搜索 2 个地点和一个名称(或分别是 2 个名称和一个地点)。如果所有搜索的实体都必须出现在一本书中,这很简单,但是这样呢:
正在寻找:Tim、Al、Toolshop
结果:书籍包括
(蒂姆、艾尔、工具店)或
(Tim, Al) 或
(蒂姆,工具店)或
(艾尔,工具店)

该问题可以在 4、5、6...条件下重复。
我想过交叉更多子查询,但这行不通。
相反,我会联合找到的 bookID,将它们分组并选择 bookid 出现不止一次:
WITH b AS (
    SELECT bookid, count(bookid) AS occurrences
    FROM
        (SELECT DISTINCT bookid
        FROM characterNames
        WHERE name='XXX'
        UNION
        SELECT DISTINCT bookid
        FROM characterNames
        WHERE name='YYY'
        UNION
        SELECT DISTINCT bookid
        FROM locations
        WHERE locname='ZZZ'
        GROUP BY bookid)
    WHERE occurrences>1)

我认为这有效,目前无法测试,但这是最好的方法吗?

最佳答案

对一般情况使用计数的想法是合理的。不过,对语法进行了一些调整:

WITH b AS (
   SELECT bookid
   FROM  (
      SELECT DISTINCT bookid
      FROM   characterNames
      WHERE  name='XXX'

      UNION ALL
      SELECT DISTINCT bookid
      FROM   characterNames
      WHERE  name='YYY'

      UNION ALL
      SELECT DISTINCT bookid
      FROM   locations
      WHERE  locname='ZZZ'
      ) x
   GROUP  BY bookid
   HAVING count(*) > 1
   )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'

UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;

笔记
  • 使用 UNION ALL (不是 UNION )来保留子查询之间的重复项。在这种情况下,您希望它们能够计算它们。
  • 子查询应该产生不同的值。它以您拥有的方式与 DISTINCT 一起使用。你可能想试试 GROUP BY 1 ,看看它是否表现更好(我不希望它如此。)
  • GROUP BY 必须在子查询之外。它只会应用于最后一个子查询,并且在那里没有意义,因为您已经拥有 DISTINCT bookid
  • 检查一本书是否有多个点击必须进入 HAVING 子句:
     HAVING count(*) > 1
    

    您不能在 WHERE 子句中使用聚合值。



  • 在一张 table 上组合条件

    您不能简单地在一张表上组合多个条件。您将如何计算发现的数量?但是还有一种更复杂的方法。可能会或可能不会提高性能,您必须进行测试(使用 EXPLAIN ANALYZE )。两个查询都需要对表 characterNames 至少进行两次索引扫描。至少它缩短了语法。

    考虑我如何计算 characterNames 的命中数以及如何在外部 sum(hits) 中更改为 SELECT :
    WITH b AS (
       SELECT bookid
       FROM  (
          SELECT bookid
               , max((name='XXX')::int)
               + max((name='YYY')::int) AS hits
          FROM   characterNames
          WHERE  (name='XXX'
               OR name='YYY')
          GROUP  BY bookid
    
          UNION ALL
          SELECT DISTINCT bookid, 1 AS hits
          FROM   locations
          WHERE  locname='ZZZ'
          ) x
       GROUP  BY bookid
       HAVING sum(hits) > 1
       )
    ...
    

    boolean 转换为 integer0 提供 FALSE ,为 1 提供 TRUE 。这有帮助。

    使用 EXISTS 更快

    当我骑自行车去我的公司时,这件事一直在我的脑后踢。我有理由相信这个查询可能会更快。请试一试:
    WITH b AS (
       SELECT bookid
    
            , (EXISTS (
                SELECT *
                FROM   characterNames c
                WHERE  c.bookid = b.bookid
                AND    c.name = 'XXX'))::int
            + (EXISTS (
                SELECT *
                FROM   characterNames c
                WHERE  c.bookid = b.bookid
                AND    c.name = 'YYY'))::int AS c_hits
    
            , (EXISTS (
                SELECT *
                FROM   locations l
                WHERE  l.bookid = b.bookid
                AND    l.locname='ZZZ'))::int AS l_hits
       FROM   books b
       WHERE  (c_hits + l_hits) > 1
       )
    SELECT c.bookid, c.position, 'char' AS what
    FROM   b
    JOIN   characternames c USING (bookid)
    WHERE  b.c_hits > 0
    AND    c.name IN ('XXX', 'YYY')
    
    UNION  ALL
    SELECT l.bookid, l.position, 'loc' AS what
    FROM   b
    JOIN   locations l USING (bookid)
    WHERE  b.l_hits > 0
    AND    l.locname = 'YYY'
    ORDER  BY 1,2,3;
    
  • EXISTS 半连接可以在第一次匹配时停止执行。由于我们只对 CTE 中全有或全无的答案感兴趣,因此这可能会更快地完成工作。
  • 这样我们也不需要聚合(不需要 GROUP BY)。
  • 我还记得是否找到了任何字符或位置,并且只重新访问具有实际匹配项的表。
  • 关于python - 合并 2 个以上子查询的有效查询,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/10273604/

    10-10 21:51
    查看更多