图数据库或关系数据库通用表扩展:比较非循环图查询性能

本文介绍了图数据库或关系数据库通用表扩展:比较非循环图查询性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于高度连接的非循环图数据，图数据库是否比关系数据库更有效?

Are graph databases more performant than relational databases for highly connected acyclic graph data?

我需要显着加快查询结果的速度，并希望图数据库成为答案.当我使用Common Table Extensions对我的样本数据进行从16小时到30分钟的递归搜索时，我在关系数据库查询中看到了显着的进步.不过，对于Web应用程序而言，30分钟仍然太长，而依靠这种缓存来解决这种响应很快变得非常荒谬.

I need to significantly speed up my query results and hope that graph databases will be the answer. I had seen significant improvement in my relational database queries when I used Common Table Extensions bringing a recursive search of my sample data from 16 hours to 30 minutes. Still, 30 minutes is way too long for a web application and trying to work around that kind of response gets rather ridiculous pretty quickly relying on caching.

我的Gremlin查询看起来像:

My Gremlin query looks something like:

g.withSack(100D).
V(with vertex id).
repeat(out('edge_label').
sack(div).by(constant(2D))).
emit().
group().by('node_property').by(sack().sum()).
unfold().
order().by(values,decr).
fold()

等效于Cypher(感谢CyberSam)，例如:

a Cypher equivalent (thank you cyberSam) something like:

MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC

和我的SQL大致类似:

and my SQL roughly like:

WITH PctCTE(id, pt, tipe, ct)
AS
    (SELECT id, CONVERT(DECIMAL(28,25),100.0) AS pt, kynd, 1
        FROM db.reckrd parent
        WHERE parent.id = @id
    UNION ALL
        SELECT child.id, CONVERT(DECIMAL(28,25),parent.pt/2.0), child.kynd, parent.ct+1
        FROM db.reckrd AS child
        INNER JOIN PctCTE AS parent
        ON (parent.tipe = 'M' AND
        (child .emm = parent.id))
        OR
        (NOT parent.tipe = 'M' AND
        (child .not_emm = parent.id))
    ),
    mergeCTE(dups, h, p)
    AS
        (SELECT ROW_NUMBER () OVER (PARTITION BY id ORDER BY ct) 'dups', id, SUM(pt) OVER (PARTITION BY id)
        FROM PctCTE
        )

它应该在我的测试实例中返回具有500,000+条边的结果集.

which should return a result set with 500,000+ edges in my test instance.

如果我进行过滤以减小输出的大小，那么仍然必须先遍历所有这些边后，我才能获得要分析的有趣内容.

If I filtered to reduce the size of the output, it would still have to be after traversing all of those edges first for me to get to the interesting stuff I want to analyse.

我可以预见，对真实数据的一些查询越来越接近必须遍历3,000,000+条边...

I can foresee some queries on real data getting closer to having to traverse 3,000,000+ edges ...

如果不是图形数据库的答案，那么CTE会和它一样好吗?

If graph databases aren't the answer, is a CTE as good as it gets?

推荐答案

我在BerkeleyDB Java Edition上尝试了JanusGraph-0.5.2.我的样本数据集具有580832个顶点，从大约1 gb graphML文件中加载了2325896条边.网络平均度为4，直径为30，平均路径长度为1124，模块性为0.7，平均聚类系数为0.013，特征向量中心性(100次迭代)为4.5.

I tried JanusGraph-0.5.2 with BerkeleyDB Java Edition. My sample data set has 580832 vertices, 2325896 edges loaded from a roughly 1 gb graphML file. The network average degree is 4, diameter 30, average path length 1124, modularity 0.7, average clustering coefficient 0.013 and eigenvector centrality (100 iterations) of 4.5.

毫无疑问，我正在相当惊奇地进行查询，但是在等待10个小时之后才收到Java堆栈内存不足错误，很明显，我的CTE性能至少快了20倍！！！

No doubt I am doing my query rather amatuerishly, but after waiting 10 hours only to receive a Java stack out of memory error, it is clear that my CTE performance is at least 20 times faster!!!

我的conf/janusgraph-berkeleyje.properties文件包括以下设置:

My conf/janusgraph-berkeleyje.properties file included the following settings:

gremlin.graph = org.janusgraph.core.JanusGraphFactory
storage.backent = berkeleyje
storage.directory = ../db/berkeley
cache.db-cache = true
cache.db-cache-size = 0.5
cache.db-cache-time = 0
cache.tx-cache-size = 20000
cache.db-cache-clean-wait = 0
storage.transaction = false
storage.berkeleyje.cache-percentage = 65

在我的调查的这个阶段，似乎CTE在重递归查询上的性能至少比图形数据库高一个数量级.我想错了...

At this stage in my investigation, it would appear that CTE's are at least an order of magnitude more performant on heavily recursive queries than graph databases. I would love to be wrong...

这篇关于图数据库或关系数据库通用表扩展:比较非循环图查询性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！