我有一个架构(具有适当索引的数百万条记录)如下所示:

groups    |  interests
------    |  ---------
user_id   |  user_id
group_id  |  interest_id


用户可以喜欢0 ..很多兴趣并且属于0 ..很多组。

问题:给定一个组ID,我希望获得不属于该组的所有用户的所有兴趣,并且与属于同一提供的组的任何人至少共享一个兴趣。

由于上面的内容可能令人困惑,因此这是一个简单的示例(SQLFiddle):

| 1 | 2 | 3 | 4 | 5 | (User IDs)
|-------------------|
| A |   | A |   |   |
| B | B | B |   | B |
|   | C |   |   |   |
|   |   | D | D |   |


在上面的示例中,用户用数字标记,而兴趣包含字符。

如果我们假设用户1和2属于组-1,那么用户3和5将很有趣:

user_id  interest_id
-------  -----------
      3            A
      3            B
      3            D
      5            B


我已经编写了一个愚蠢且效率低下的查询,可以正确返回以上内容:

SELECT * FROM "interests" WHERE "user_id" IN (
    SELECT "user_id" FROM "interests" WHERE "interest_id" IN (
        SELECT "interest_id" FROM "interests" WHERE "user_id" IN (
            SELECT "user_id" FROM "groups" WHERE "group_id" = -1
        )
    ) AND "user_id" NOT IN (
        SELECT "user_id" FROM "groups" WHERE "group_id" = -1
    )
);


但是我所有尝试将其转换为适当的联接查询的尝试都没有结果:要么查询返回的行数超过其应有的行数,要么只花10倍的时间作为子查询,例如:

SELECT "iii"."user_id" FROM "interests" AS "iii"
WHERE EXISTS
(
    SELECT "ii"."user_id", "ii"."interest_id" FROM "groups" AS "gg"
    INNER JOIN "interests" AS "ii" ON "gg"."user_id" = "ii"."user_id"
    WHERE EXISTS
    (
        SELECT "i"."interest_id" FROM "groups" AS "g"
        INNER JOIN "interests" AS "i" ON "g"."user_id" = "i"."user_id"
        WHERE "group_id" = -1 AND "i"."interest_id" = "ii"."interest_id"
    ) AND "group_id" != -1 AND "ii"."user_id" = "iii"."user_id"
);


在过去的两个晚上,我一直在努力优化此查询...

任何能帮助我朝正确方向发展的帮助或见解将不胜感激。 :)



PS:理想情况下,返回一个共同兴趣总数的查询会更好:

user_id  totalInterests  commonInterests
-------  --------------  ---------------
      3               3              1/2 (either is fine, but 2 is better)
      5               1                1


但是,我不确定与在代码中进行比较会慢多少。

最佳答案

使用以下内容设置测试表

--drop table Interests  ----------------------------
CREATE TABLE Interests
 (
   InterestId  char(1)  not null
  ,UserId      int      not null
 )

INSERT Interests values
  ('A',1)
 ,('A',3)
 ,('B',1)
 ,('B',2)
 ,('B',3)
 ,('B',5)
 ,('C',2)
 ,('D',3)
 ,('D',4)


--  drop table Groups  ---------------------
CREATE TABLE Groups
 (
   GroupId  int  not null
  ,UserId   int  not null
 )

INSERT Groups values
  (-1, 1)
 ,(-1, 2)


SELECT * from Groups
SELECT * from Groups


以下查询似乎可以完成您想要的事情:

DECLARE @GroupId int

SET @GroupId = -1

;WITH cteGroupInterests (InterestId)
 as (--  List of the interests referenced by the target group
     select distinct InterestId
      from Groups gr
       inner join Interests nt
        on nt.UserId = gr.UserId
      where gr.GroupId = @GroupId)
--  Aggregate interests for each user
SELECT
   UserId
  ,count(OwnInterstId)      OwnInterests
  ,count(SharedInterestId)  SharedInterests
 from (--  Subquery lists all interests for each user
       select
          nt.UserId
         ,nt.InterestId   OwnInterstId
         ,cte.InterestId  SharedInterestId
        from Interests nt
         left outer join cteGroupInterests cte
          on cte.InterestId = nt.InterestId
        where not exists (--  Correlated subquery: is "this" user in the target group?)
                          select 1
                           from Groups gr
                           where gr.GroupId = @GroupId
                            and gr.UserId = nt.UserId)) xx
 group by UserId
 having count(SharedInterestId) > 0


它似乎可以工作,但是我想做更多复杂的测试,而且我不知道它在数百万行中的工作情况如何。关键点是:


cte创建一个临时表,供以后的查询引用;建立实际的临时表可能会提高性能
关联的子查询可能很棘手,但是索引和not exists应该可以使此操作很快
我很懒,没有加所有下划线,对不起

08-18 01:09