我在此表格的表中有值。
id | val1 | val2 -------------------- 1 | e1 | m1 2 | e1 | m2 3 | e2 | m2 4 | e3 | m1 5 | e4 | m3 6 | e5 | m3 7 | e5 | m4 8 | e4 | m5
From this, I have to recover unique users like this and give them a unique id to identify.
User1 -> (val1 : e1, e2, e3 | val2: m1, m2)
e1 <-> m1, e1 <-> m2, m1 <-> e3, e2 <-> m2 ( <-> means linked).
e1 is connected to m1.
e1 is connected to m2.
m2 is connected to e2.
So e1,m1 are connected to e2.
Similarly, we find e1, e2, e3, m1, m2 all are linked. We need to identify these chains.
User2 -> (val1 : e4, e5 | val2: m3, m4, m5)
I have written two queries based on grouping my val1 and then by val2 separately and joining them in code (Java).
I want this to do this directly in MySQL/BigQuery query itself as we are building some reports on this.
Is this possible in a single query? Please help.
Thank you.
Update :
Desired output -
[
{
id : user1,
val1 : [e1, e2, e3],
val2 : [m1, m2]
},
{
id : user2,
val1 : [e4, e5],
val2 : [m3, m4, m5]
}
]
要么
id | val1 | val2 | UUID
------------------------
1 | e1 | m1 | u1
2 | e1 |平方米u1
3 | e2 |平方米u1
4 | e3 | m1 | u1
5 | e4 |立方米| 22
6 | e5 |立方米| 22
7 | e5 | m4 | 22
8 | e4 | m5 | 22
为简单起见,假设val1和val2的值是节点,并且如果存在于同一行中则连接。
表格中的行构成了图表(user1,user2),我们需要识别这些图表。
最佳答案
希望加入使用纯BigQuery(Standard SQL)解决任务的选项
前提条件/假设:源数据在sandbox.temp.id1_id2_pairs
中
您应该用自己的替换它,或者如果要使用问题中的伪数据进行测试-您可以按如下方式创建此表(当然,用您自己的sandbox.temp
替换project.dataset
)
确保设置了相应的目标表
注意:您可以在此答案的底部找到所有相应的查询(以文本形式),但现在我将用屏幕截图说明我的答案-所有内容均已显示-查询,结果和使用的选项
因此,将分三个步骤:
步骤1-初始化
在这里,我们只是基于与id2的连接对id1进行初始分组:
正如您在此处看到的那样,我们基于通过id2的简单一级连接创建了具有相应连接的所有id1值的列表
输出表是sandbox.temp.groups
第2步-分组迭代
在每次迭代中,我们将基于已建立的组来丰富分组。
查询的源是上一步(sandbox.temp.groups
)的输出表,目标是具有覆盖的同一表(sandbox.temp.groups
)
我们将继续迭代,直到找到的组数与上一次迭代相同为止
注意:您可以打开两个BigQuery Web UI标签(如上图所示),而无需更改任何代码,只需运行分组,然后一次又一次地检查直到迭代收敛
(对于我在先决条件部分中使用的特定数据-我进行了3次迭代-第一次迭代产生了5个用户,第二次迭代产生了3个用户,第三次迭代产生了3个用户-这表明我们已经完成了多次迭代。
当然,在现实生活中-迭代次数可能不止3次-因此我们需要某种自动化(请参见答案底部的相应部分)。
步骤3 –最终分组
id1分组完成后-我们可以为id2添加最终分组
现在最终结果在sandbox.temp.users
表中
二手查询(不要忘记设置相应的目标表,并在需要时按照上述逻辑和屏幕截图进行覆盖):
先决条件:
#standardSQL
SELECT 1 id, 'e1' id1, 'm1' id2 UNION ALL
SELECT 2, 'e1', 'm2' UNION ALL
SELECT 3, 'e2', 'm2' UNION ALL
SELECT 4, 'e3', 'm1' UNION ALL
SELECT 5, 'e4', 'm3' UNION ALL
SELECT 6, 'e5', 'm3' UNION ALL
SELECT 7, 'e5', 'm4' UNION ALL
SELECT 8, 'e4', 'm5' UNION ALL
SELECT 9, 'e6', 'm6' UNION ALL
SELECT 9, 'e7', 'm7' UNION ALL
SELECT 9, 'e2', 'm6' UNION ALL
SELECT 888, 'e4', 'm55'
步骤1
#standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1
), x2 AS (SELECT id2, STRING_AGG(id1) id1s FROM `yourTable` GROUP BY id2
), x3 AS (
SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
SELECT DISTINCT i FROM UNNEST(SPLIT(id1s)) i)) grp
FROM (
SELECT x1.id1 id, STRING_AGG((id1s)) id1s FROM x1 CROSS JOIN x2
WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(id1s)) y WHERE x1.id1 = y)
GROUP BY id1)
)
SELECT * FROM x3
第2步-分组
#standardSQL
WITH x3 AS (select * from `sandbox.temp.groups`)
SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
SELECT DISTINCT i FROM UNNEST(SPLIT(grp)) i)) grp
FROM (
SELECT a.id, STRING_AGG(b.grp) grp FROM x3 a CROSS JOIN x3 b
WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(b.grp)) y WHERE a.id = y)
GROUP BY a.id )
第2步-检查
#standardSQL
SELECT COUNT(DISTINCT grp) users FROM `sandbox.temp.groups`
第三步
#standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1
), x3 as (select * from `sandbox.temp.groups`
), f AS (SELECT DISTINCT grp FROM x3 ORDER BY grp
)
SELECT ROW_NUMBER() OVER() id, grp id1,
(SELECT STRING_AGG(i ORDER BY i) FROM (SELECT DISTINCT i FROM UNNEST(SPLIT(id2)) i)) id2
FROM (
SELECT grp, STRING_AGG(id2s) id2 FROM f
CROSS JOIN x1 WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(f.grp)) y WHERE id1 = y)
GROUP BY grp)
自动化:
当然,如果迭代快速收敛,可以手动执行上述“过程”,因此最终将运行10-20次。但是在现实生活中,您可以通过选择的任何client轻松地自动执行此操作
关于mysql - 从链接值中查找唯一用户,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47357176/