我有一个架构(具有适当索引的数百万条记录)如下所示:
groups | interests
------ | ---------
user_id | user_id
group_id | interest_id
用户可以喜欢0 ..很多兴趣并且属于0 ..很多组。
问题:给定一个组ID,我希望获得不属于该组的所有用户的所有兴趣,并且与属于同一提供的组的任何人至少共享一个兴趣。
由于上面的内容可能令人困惑,因此这是一个简单的示例(SQLFiddle):
| 1 | 2 | 3 | 4 | 5 | (User IDs)
|-------------------|
| A | | A | | |
| B | B | B | | B |
| | C | | | |
| | | D | D | |
在上面的示例中,用户用数字标记,而兴趣包含字符。
如果我们假设用户1和2属于组-1,那么用户3和5将很有趣:
user_id interest_id
------- -----------
3 A
3 B
3 D
5 B
我已经编写了一个愚蠢且效率低下的查询,可以正确返回以上内容:
SELECT * FROM "interests" WHERE "user_id" IN (
SELECT "user_id" FROM "interests" WHERE "interest_id" IN (
SELECT "interest_id" FROM "interests" WHERE "user_id" IN (
SELECT "user_id" FROM "groups" WHERE "group_id" = -1
)
) AND "user_id" NOT IN (
SELECT "user_id" FROM "groups" WHERE "group_id" = -1
)
);
但是我所有尝试将其转换为适当的联接查询的尝试都没有结果:要么查询返回的行数超过其应有的行数,要么只花10倍的时间作为子查询,例如:
SELECT "iii"."user_id" FROM "interests" AS "iii"
WHERE EXISTS
(
SELECT "ii"."user_id", "ii"."interest_id" FROM "groups" AS "gg"
INNER JOIN "interests" AS "ii" ON "gg"."user_id" = "ii"."user_id"
WHERE EXISTS
(
SELECT "i"."interest_id" FROM "groups" AS "g"
INNER JOIN "interests" AS "i" ON "g"."user_id" = "i"."user_id"
WHERE "group_id" = -1 AND "i"."interest_id" = "ii"."interest_id"
) AND "group_id" != -1 AND "ii"."user_id" = "iii"."user_id"
);
在过去的两个晚上,我一直在努力优化此查询...
任何能帮助我朝正确方向发展的帮助或见解将不胜感激。 :)
PS:理想情况下,返回一个共同兴趣总数的查询会更好:
user_id totalInterests commonInterests
------- -------------- ---------------
3 3 1/2 (either is fine, but 2 is better)
5 1 1
但是,我不确定与在代码中进行比较会慢多少。
最佳答案
使用以下内容设置测试表
--drop table Interests ----------------------------
CREATE TABLE Interests
(
InterestId char(1) not null
,UserId int not null
)
INSERT Interests values
('A',1)
,('A',3)
,('B',1)
,('B',2)
,('B',3)
,('B',5)
,('C',2)
,('D',3)
,('D',4)
-- drop table Groups ---------------------
CREATE TABLE Groups
(
GroupId int not null
,UserId int not null
)
INSERT Groups values
(-1, 1)
,(-1, 2)
SELECT * from Groups
SELECT * from Groups
以下查询似乎可以完成您想要的事情:
DECLARE @GroupId int
SET @GroupId = -1
;WITH cteGroupInterests (InterestId)
as (-- List of the interests referenced by the target group
select distinct InterestId
from Groups gr
inner join Interests nt
on nt.UserId = gr.UserId
where gr.GroupId = @GroupId)
-- Aggregate interests for each user
SELECT
UserId
,count(OwnInterstId) OwnInterests
,count(SharedInterestId) SharedInterests
from (-- Subquery lists all interests for each user
select
nt.UserId
,nt.InterestId OwnInterstId
,cte.InterestId SharedInterestId
from Interests nt
left outer join cteGroupInterests cte
on cte.InterestId = nt.InterestId
where not exists (-- Correlated subquery: is "this" user in the target group?)
select 1
from Groups gr
where gr.GroupId = @GroupId
and gr.UserId = nt.UserId)) xx
group by UserId
having count(SharedInterestId) > 0
它似乎可以工作,但是我想做更多复杂的测试,而且我不知道它在数百万行中的工作情况如何。关键点是:
cte创建一个临时表,供以后的查询引用;建立实际的临时表可能会提高性能
关联的子查询可能很棘手,但是索引和
not exists
应该可以使此操作很快我很懒,没有加所有下划线,对不起