问题描述
好的概述
一般来说,您要在快速读取时间(例如嵌套集)或快速写入时间(邻接列表)之间做出决定.通常,您最终会得到以下最适合您需求的选项组合.以下提供了一些深入阅读:
Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:
- 一个嵌套间隔与邻接列表的比较:我发现的邻接列表、物化路径、嵌套集和嵌套间隔的最佳比较.
- 分层数据模型:幻灯片很好地解释了权衡和示例用法
- 在 MySQL 中表示层次结构:非常好的嵌套集概述特别是
- RDBMS 中的分层数据:最全面且组织良好的数据集我看过的链接,但没有太多解释
- One more Nested Intervals vs. Adjacency List comparison: the best comparison of Adjacency List, Materialized Path, Nested Set, and Nested Interval I've found.
- Models for hierarchical data: slides with good explanations of tradeoffs and example usage
- Representing hierarchies in MySQL: very good overview of Nested Set in particular
- Hierarchical data in RDBMSs: a most comprehensive and well-organized set of links I've seen, but not much in the way of explanation
选项
我知道的和一般特征:
- 邻接列表:
- 列:ID、ParentID
- 易于实施.
- 便宜的节点移动、插入和删除.
- 寻找级别、血统和血统的成本很高后代,路径
- 在支持 N+1 的数据库中通过通用表表达式避免使用 N+1
- Columns: ID, ParentID
- Easy to implement.
- Cheap node moves, inserts, and deletes.
- Expensive to find the level, ancestry & descendants, path
- Avoid N+1 via Common Table Expressions in databases that support them
- 列:左、右
- 廉价的血统、后代
- 非常昂贵的
O(n/2)
由于易失性编码而导致的移动、插入、删除 - Columns: Left, Right
- Cheap ancestry, descendants
- Very expensive
O(n/2)
moves, inserts, deletes due to volatile encoding - 使用带有祖先、后代、深度(可选)的单独连接表
- 廉价的祖先和后代
- 为插入、更新、删除写入成本
O(log n)
(子树的大小) - 规范化编码:有利于 RDBMS 统计和连接中的查询规划器
- 每个节点需要多行
- Lineage Column (a.k.a. Materialized Path, Path Enumeration)
- 列:血统(例如/parent/child/grandchild/etc...)
- 通过前缀查询的廉价后代(例如
LEFT(lineage, #) = '/enumerated/path'
) - 为插入、更新、删除写入成本
O(log n)
(子树的大小) - 非关系:依赖于数组数据类型或序列化字符串格式
- 类似于嵌套集,但使用实数/浮点数/十进制数,因此编码不会不稳定(移动/插入/删除便宜)
- 存在实数/浮点数/十进制表示/精度问题
- 矩阵编码变体为"添加祖先编码(物化路径)免费",但增加了线性代数的技巧.
- Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
- Has real/float/decimal representation/precision issues
- Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with the added trickiness of linear algebra.
- 修改后的邻接列表,为每条记录添加了级别和排名(例如排序)列.
- 迭代/分页便宜
- 昂贵的移动和删除
- 好的用途:线程讨论 - 论坛/博客评论
- Columns:每个血统级别一个,指的是所有父级直到根,从项目级别向下的级别设置为NULL
- 廉价的祖先、后代、级别
- 简单的插入、删除、移动叶子
- 昂贵的插入、删除、移动内部节点
- 对层次结构的深度有严格限制
- 使用CONNECT BY遍历邻接表
- ltree 数据类型,用于物化路径
- ltree datatype for Materialized Path
- 一般摘要
- 2008 提供了 HierarchyId 数据类型,似乎有助于谱系列接近并扩展可以表示的深度.
- General summary
- 2008 offers HierarchyId data type that appears to help with the Lineage Column approach and expand the depth that can be represented.
数据库特定说明
MySQL
甲骨文
PostgreSQL
SQL Server
推荐答案
我最喜欢的答案是该主题中第一句话所建议的.使用 Adjacency List 维护层次结构,使用 Nested Sets 查询层次结构.
My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.
到目前为止的问题是从邻接列表到嵌套集的覆盖方法慢得可怕,因为大多数人使用称为推送堆栈"的极端 RBAR 方法进行转换,并且被认为是通过邻接表和嵌套集的惊人性能达到维护简单的涅磐的昂贵方式.结果,大多数人最终不得不满足于其中之一,尤其是当节点数量超过 100,000 个左右时.使用推送堆栈方法可能需要一整天的时间才能对 MLM 人员认为的小型百万节点层次结构进行转换.
The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.
我想我会想出一种方法来以似乎不可能的速度将邻接列表转换为嵌套集,从而给 Celko 带来一些竞争.这是我的 i5 笔记本电脑上推送堆栈方法的性能.
I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.
Duration for 1,000 Nodes = 00:00:00:870
Duration for 10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for 100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100)
Duration for 1,000,000 Nodes = 'Didn't even try this'
这是新方法的持续时间(括号中是压栈方法).
And here's the duration for the new method (with the push stack method in parenthesis).
Duration for 1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for 10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for 100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)
是的,没错.不到一分钟转换100万个节点,不到4秒转换10万个节点.
Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.
您可以在以下 URL 中阅读有关新方法的信息并获取代码副本.http://www.sqlservercentral.com/articles/Hierarchy/94040/
You can read about the new method and get a copy of the code at the following URL.http://www.sqlservercentral.com/articles/Hierarchy/94040/
我还使用类似的方法开发了一个预聚合"层次结构.传销人员和制作物料清单的人将对本文特别感兴趣.http://www.sqlservercentral.com/articles/T-SQL/94570/
I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article.http://www.sqlservercentral.com/articles/T-SQL/94570/
如果您想停下来看看这两篇文章,请跳到加入讨论"链接,让我知道您的想法.
If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.
这篇关于在关系数据库中存储分层数据有哪些选项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!