问题描述
对于高度连接的非循环图数据,图数据库是否比关系数据库更有效?
Are graph databases more performant than relational databases for highly connected acyclic graph data?
我需要显着加快查询结果的速度,并希望图数据库成为答案.当我使用Common Table Extensions对我的样本数据进行从16小时到30分钟的递归搜索时,我在关系数据库查询中看到了显着的进步.不过,对于Web应用程序而言,30分钟仍然太长,而依靠这种缓存来解决这种响应很快变得非常荒谬.
I need to significantly speed up my query results and hope that graph databases will be the answer. I had seen significant improvement in my relational database queries when I used Common Table Extensions bringing a recursive search of my sample data from 16 hours to 30 minutes. Still, 30 minutes is way too long for a web application and trying to work around that kind of response gets rather ridiculous pretty quickly relying on caching.
我的Gremlin查询看起来像:
My Gremlin query looks something like:
g.withSack(100D).
V(with vertex id).
repeat(out('edge_label').
sack(div).by(constant(2D))).
emit().
group().by('node_property').by(sack().sum()).
unfold().
order().by(values,decr).
fold()
等效于Cypher(感谢CyberSam),例如:
a Cypher equivalent (thank you cyberSam) something like:
MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC
和我的SQL大致类似:
and my SQL roughly like:
WITH PctCTE(id, pt, tipe, ct)
AS
(SELECT id, CONVERT(DECIMAL(28,25),100.0) AS pt, kynd, 1
FROM db.reckrd parent
WHERE parent.id = @id
UNION ALL
SELECT child.id, CONVERT(DECIMAL(28,25),parent.pt/2.0), child.kynd, parent.ct+1
FROM db.reckrd AS child
INNER JOIN PctCTE AS parent
ON (parent.tipe = 'M' AND
(child .emm = parent.id))
OR
(NOT parent.tipe = 'M' AND
(child .not_emm = parent.id))
),
mergeCTE(dups, h, p)
AS
(SELECT ROW_NUMBER () OVER (PARTITION BY id ORDER BY ct) 'dups', id, SUM(pt) OVER (PARTITION BY id)
FROM PctCTE
)
它应该在我的测试实例中返回具有500,000+条边的结果集.
which should return a result set with 500,000+ edges in my test instance.
如果我进行过滤以减小输出的大小,那么仍然必须先遍历所有这些边后,我才能获得要分析的有趣内容.
If I filtered to reduce the size of the output, it would still have to be after traversing all of those edges first for me to get to the interesting stuff I want to analyse.
我可以预见,对真实数据的一些查询越来越接近必须遍历3,000,000+条边...
I can foresee some queries on real data getting closer to having to traverse 3,000,000+ edges ...
如果不是图形数据库的答案,那么CTE会和它一样好吗?
If graph databases aren't the answer, is a CTE as good as it gets?
推荐答案
我在BerkeleyDB Java Edition上尝试了JanusGraph-0.5.2.我的样本数据集具有580832个顶点,从大约1 gb graphML文件中加载了2325896条边.网络平均度为4,直径为30,平均路径长度为1124,模块性为0.7,平均聚类系数为0.013,特征向量中心性(100次迭代)为4.5.
I tried JanusGraph-0.5.2 with BerkeleyDB Java Edition. My sample data set has 580832 vertices, 2325896 edges loaded from a roughly 1 gb graphML file. The network average degree is 4, diameter 30, average path length 1124, modularity 0.7, average clustering coefficient 0.013 and eigenvector centrality (100 iterations) of 4.5.
毫无疑问,我正在相当惊奇地进行查询,但是在等待10个小时之后才收到Java堆栈内存不足错误,很明显,我的CTE性能至少快了20倍!!!
No doubt I am doing my query rather amatuerishly, but after waiting 10 hours only to receive a Java stack out of memory error, it is clear that my CTE performance is at least 20 times faster!!!
我的conf/janusgraph-berkeleyje.properties文件包括以下设置:
My conf/janusgraph-berkeleyje.properties file included the following settings:
gremlin.graph = org.janusgraph.core.JanusGraphFactory
storage.backent = berkeleyje
storage.directory = ../db/berkeley
cache.db-cache = true
cache.db-cache-size = 0.5
cache.db-cache-time = 0
cache.tx-cache-size = 20000
cache.db-cache-clean-wait = 0
storage.transaction = false
storage.berkeleyje.cache-percentage = 65
在我的调查的这个阶段,似乎CTE在重递归查询上的性能至少比图形数据库高一个数量级.我想错了...
At this stage in my investigation, it would appear that CTE's are at least an order of magnitude more performant on heavily recursive queries than graph databases. I would love to be wrong...
这篇关于图数据库或关系数据库通用表扩展:比较非循环图查询性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!