问题描述
使用两个数据库来说明这个例子:CouchDB 和 卡桑德拉.
Using two databases to illustrate this example: CouchDB and Cassandra.
CouchDB 使用 B+ 树作为文档索引(使用巧妙的修改在他们的仅附加环境) - 更具体地说,当文档被修改(插入/更新/删除)时,它们被附加到正在运行的数据库文件以及一个完整的叶子 -> B+ 树中的所有节点的节点路径受更新版本影响就在文档之后.
CouchDB uses a B+ Tree for document indexes (using a clever modification to work in their append-only environment) - more specifically as documents are modified (insert/update/delete) they are appended to the running database file as well as a full Leaf -> Node path from the B+ tree of all the nodes effected by the updated revision right after the document.
这些零碎的索引修订与修改一起内联,这样完整索引是附加在文件末尾的最新索引修改的联合,以及数据文件中仍然相关的其他部分并且还没有修改.
These piece-mealed index revisions are inlined right alongside the modifications such that the full index is a union of the most recent index modifications appended at the end of the file along with additional pieces further back in the data file that are still relevant and haven't been modified yet.
搜索B+树是O(logn).
卡桑德拉
Cassandra 将记录键在内存中的表中排序(让我们将它们视为这个问题的数组)并将它们作为单独的(排序的)写出 排序字符串表 不时.
我们可以将所有这些表的集合视为索引"(据我所知).
We can think of the collection of all of these tables as the "index" (from what I understand).
Cassandra 需要压缩/组合这些排序字符串表不时地,创建一个更完整的索引文件表示.
Cassandra is required to compact/combine these sorted-string tables from time to time, creating a more complete file representation of the index.
搜索排序数组是 O(logn).
问题
假设在 CouchDB 中维护部分 B+ 树块与在 Cassandra 中维护部分排序字符串索引之间的复杂程度相似,并且鉴于两者都提供 O(logn) 搜索时间,您认为哪个可以更好地表示数据库索引以及为什么?
Question
Assuming a similar level of complexity between either maintaining partial B+ tree chunks in CouchDB versus partial sorted-string indices in Cassandra and given that both provide O(logn) search time which one do you think would make a better representation of a database index and why?
我特别好奇是否有一个实现细节使其特别具有吸引力,或者它们是否都是一种清洗,而您只需选择您喜欢使用的任何数据结构/对开发者来说更有意义.
I am specifically curious if there is an implementation detail about one over the other that makes it particularly attractive or if they are both a wash and you just pick whichever data structure you prefer to work with/makes more sense to the developer.
谢谢你的想法.
推荐答案
在比较 BTree 索引和 SSTable 索引时,应该考虑写入复杂度:
When comparing a BTree index to an SSTable index, you should consider the write complexity:
当随机写入写入时复制的 BTree 时,您将产生随机读取(以进行叶节点和路径的复制).因此,虽然写入在磁盘上是顺序的,但对于大于 RAM 的数据集,这些随机读取将很快成为瓶颈.对于类似 SSTable 的索引,写入时不会发生此类读取 - 只有顺序写入.
When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.
您还应该考虑在最坏的情况下,对 BTree 的每次更新都可能导致 log_b N 次 IO - 也就是说,您最终可能会为每个键写入 3 或 4 个块.如果密钥大小远小于块大小,这是非常昂贵的.对于类似SSTable的索引,每次写IO都会包含尽可能多的新键,所以每个键的IO成本更像是1/B.
You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.
在实践中,这使得类似 SSTable 的速度比 BTrees 快数千倍(对于随机写入).
In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.
在考虑实现细节时,我们发现实现类似 SSTable 的索引(几乎)无锁要容易得多,而 BTree 的锁策略变得相当复杂.
When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.
您还应该重新考虑您的阅读成本.您是正确的,因为 BTree 是 O(log_b N) 随机点读取的随机 IO,但类似 SSTable 的索引实际上是 O(#sstables . log_b N).如果没有合适的合并方案,#sstables 与 N 成正比.有各种技巧可以解决这个问题(例如,使用布隆过滤器),但这些对小型随机范围查询没有帮助.这是我们在 Cassandra 中发现的:
You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:
这就是为什么我们的 (GPL) 存储引擎 Castle 的合并方式略有不同,并且可以在写入性能(O(log^2 N/B)).在实践中,我们发现它也比 Cassandra 的 SSTable 写入索引更快.
This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.
如果你想了解更多,我已经讨论了它是如何工作的:
If you want to know more about this, I've given a talk about how it works:
这篇关于用于数据库索引的排序字符串表 (SSTable) 或 B+ 树?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!