想象一个由 URL 和用于描述它们的标签组成的图形数据库。从中我们想找出哪些标签集最常一起使用,并确定哪些 URL 属于每个识别的集合。

我试图在 cypher 中创建一个数据集来简化这个问题:

CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })

使用它作为引用( neo4j console example here ),我们可以查看它并直观地识别出最常用的标签是 techmice(对此的查询很简单),它们都引用了 3 个 URL。最常用的标记对是 [tech, mice],因为它(在本例中)是唯一由 2 个 url(u4 和 u1)共享的标记对。需要注意的是,这个标签对是匹配 URL 的一个子集,它不是两者的整个集合。任何网址都没有共享 3 个标签的组合。

我如何编写 cypher 查询来确定哪些标签组合最常一起使用(成对使用,或在 N 个大小的组中使用)?也许有更好的方法来构造这些数据,这将使分析更容易?或者这个问题不适合图形数据库?一直在努力解决这个问题,任何帮助或想法将不胜感激!

最佳答案

看起来像是组合数学的问题。

// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U,
     collect(distinct T) as TAGS

// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available,
// use the logarithm and exponent
//
WITH U, TAGS,
     toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations

// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex

// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex,
     toInt(ceil(exp(log(2) * tagIndex))) as pw2
     call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,
     value WHERE value > 0

// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex,
     collect(TAGS[tagIndex]) as combination

// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls
       ORDER BY freq DESC

我认为最好在打标签的时候用这个算法来计算和存储标签组合。查询将是这样的:
MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC

关于neo4j - 查找最常用的不同术语集,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/39518602/

10-15 03:47