本文介绍了gremlin语法以计算Jaccard相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣为图中未直接连接的所有顶点对计算Jaccard相似性度量.雅卡德度量标准定义为两个顶点的邻居的交点的范数除以相同集合的并集范数.

I'm interested in calculating the Jaccard similarity metric for all pairs of vertices in a graph that are not directly connected. The Jaccard metric is defined as the norm of the intersection of the neighbors of the two vertices divided by the norm of the union of the same sets.

其中

到目前为止,我已经能够获得所有未直接连接的节点对(仅对这种情况感兴趣,以便进行链接预测,如果已经存在直接链接,那么我就不需要计算Jaccard度量标准),因此对于一对(x,y)其中x不等于y:

So far I have been able to get all pairs of nodes not directly connected (only interested in this case for link prediction, if a direct link already exists then I do not need to calculate the Jaccard metric) such that for a pair (x, y) where x not equals y:

g.V().as('v1').V().where(neq('v1')).as('v2').filter(__.not(inE().where(outV().as('v1'))))

此外,我还为标记为v1out和v2out的每个配对成员包括邻居:

In addition to that I also include the neighbors for each pair member labeled v1out and v2out:

g.V().as('v1').out().as('v1out').V().where(neq('v1')).as('v2').filter(__.not(inE().where(outV().as('v1')))).out().as('v2out')

在这里,我将如何执行集合操作以获取两个相邻集合的交集和并集中的元素数?在那之后,我相信我可以追加数学步骤(当前使用TinkerPop 3.4.0)来计算Jaccard相似率,然后是选择步骤以在值大于阈值时添加边缘.如果完全不同的方法比上面的部分解决方案有好处,我很乐意采用它并最终使它起作用.

From here how would I perform the set operations to get the number of elements in the intersection and union of the two neighbor sets? After that I believe I can append the math step (currently using TinkerPop 3.4.0) to calculate the Jaccard similarity ratio followed by a choose step to add an edge when the value is greater than a threshold. If a completely different approach has benefits over the partial solution above I would be happy to adopt it and finally get this working.

推荐答案

让我们逐步进行操作:

找到顶点对,并收集它们各自的邻居:

Find pairs of vertices and also collect their respective neighbors:

g.V().match(
      __.as('v1').out().dedup().fold().as('v1n'),
      __.as('v1').V().as('v2'),
      __.as('v2').out().dedup().fold().as('v2n')).
    where('v1', neq('v2'))

确保v1不是v2的邻居,反之亦然:

Make sure that v1 is not a neighbor of v2 and vice versa:

g.V().match(
      __.as('v1').out().dedup().fold().as('v1n'),
      __.as('v1').V().as('v2'),
      __.as('v2').out().dedup().fold().as('v2n')).
    where('v1', neq('v2').and(without('v2n'))).
    where('v2', without('v1n'))

接下来,计算相交的邻居数和邻居总数:

Next, compute the number of intersecting neighbors and the total number of neighbors:

g.V().match(
      __.as('v1').out().dedup().fold().as('v1n'),
      __.as('v1').V().as('v2'),
      __.as('v2').out().dedup().fold().as('v2n')).
    where('v1', neq('v2').and(without('v2n'))).
    where('v2', without('v1n')).as('m').
  project('v1','v2','i','u').
    by(select('v1')).
    by(select('v2')).
    by(select('v1n').as('n').
       select('m').
       select('v2n').unfold().
         where(within('n')).
       count()).
    by(union(select('v1n'),
             select('v2n')).unfold().
       dedup().count())

最后,通过将i除以u来计算Jaccard相似度(还要确保滤除没有邻居的顶点以防止被0除):

And finally, compute the Jaccard similarity by dividing i by u (also make sure that vertices without neighbors get filtered out to prevent divisions by 0):

g.V().match(
      __.as('v1').out().dedup().fold().as('v1n'),
      __.as('v1').V().as('v2'),
      __.as('v2').out().dedup().fold().as('v2n')).
    where('v1', neq('v2').and(without('v2n'))).
    where('v2', without('v1n')).as('m').
  project('v1','v2','i','u').
    by(select('v1')).
    by(select('v2')).
    by(select('v1n').as('n').
       select('m').
       select('v2n').unfold().
         where(within('n')).
       count()).
    by(union(select('v1n'),
             select('v2n')).unfold().
       dedup().count()).
  filter(select('u').is(gt(0))).
  project('v1','v2','j').
    by(select('v1')).
    by(select('v2')).
    by(math('i/u'))

最后一件事:由于比较顶点v1v2与比较v2v1相同,因此查询只需要考虑一种情况.一种方法是确保v1的ID小于v2的ID:

One last thing: Since comparing vertex v1 and v2 is the same as comparing v2 and v1, the query only needs to consider one case. One way to do that is by making sure that v1's id is smaller than v2's id:

g.V().match(
      __.as('v1').out().dedup().fold().as('v1n'),
      __.as('v1').V().as('v2'),
      __.as('v2').out().dedup().fold().as('v2n')).
    where('v1', lt('v2')).
      by(id).
    where('v1', without('v2n')).
    where('v2', without('v1n')).as('m').
  project('v1','v2','i','u').
    by(select('v1')).
    by(select('v2')).
    by(select('v1n').as('n').
       select('m').
       select('v2n').unfold().
         where(within('n')).
       count()).
    by(union(select('v1n'),
             select('v2n')).unfold().
       dedup().count()).
  filter(select('u').is(gt(0))).
  project('v1','v2','j').
    by(select('v1')).
    by(select('v2')).
    by(math('i/u'))

现代玩具图上执行遍历结果如下:

Executing this traversal over the modern toy graph yields the following result:

gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().match(
......1>       __.as('v1').out().dedup().fold().as('v1n'),
......2>       __.as('v1').V().as('v2'),
......3>       __.as('v2').out().dedup().fold().as('v2n')).
......4>     where('v1', lt('v2')).
......5>       by(id).
......6>     where('v1', without('v2n')).
......7>     where('v2', without('v1n')).as('m').
......8>   project('v1','v2','i','u').
......9>     by(select('v1')).
.....10>     by(select('v2')).
.....11>     by(select('v1n').as('n').
.....12>        select('m').
.....13>        select('v2n').unfold().
.....14>          where(within('n')).
.....15>        count()).
.....16>     by(union(select('v1n'),
.....17>              select('v2n')).unfold().
.....18>        dedup().count()).
.....19>   filter(select('u').is(gt(0))).
.....20>   project('v1','v2','j').
.....21>     by(select('v1')).
.....22>     by(select('v2')).
.....23>     by(math('i/u'))
==>[v1:v[1],v2:v[5],j:0.0]
==>[v1:v[1],v2:v[6],j:0.3333333333333333]
==>[v1:v[2],v2:v[4],j:0.0]
==>[v1:v[2],v2:v[6],j:0.0]
==>[v1:v[4],v2:v[6],j:0.5]
==>[v1:v[5],v2:v[6],j:0.0]

这篇关于gremlin语法以计算Jaccard相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 06:03