问题描述
我对哈希表进行了一些研究,我一直遵循经验法则,即当有一定数量的条目(最大或通过 75% 之类的负载因子)时,应该扩展哈希表.
I've done a little research on hash tables, and I keep running across the rule of thumb that when there are a certain number of entries (either max or via a load factor like 75%) the hash table should be expanded.
几乎总是建议将哈希表的大小加倍(或加倍加 1,即 2n+1).但是,我一直找不到一个很好的理由.
Almost always, the recommendation is to double (or double plus 1, i.e., 2n+1) the size of the hash table. However, I haven't been able to find a good reason for this.
为什么要将大小加倍,而不是说,将其增加 25%,或将其增加到下一个素数或下 k 个素数(例如,三个)的大小?
Why double the size, rather than, say, increasing it 25%, or increasing it to the size of the next prime number, or next k prime numbers (e.g., three)?
我已经知道选择一个初始哈希表大小是一个质数通常是个好主意,至少如果您的哈希函数使用模数,例如通用哈希.我知道这就是为什么通常建议使用 2n+1 而不是 2n(例如,http://www.concentric.net/~Ttwang/tech/hashsize.htm)
I already know that it's often a good idea to choose an initial hash table size which is a prime number, at least if your hash function uses modulus such as universal hashing. And I know that's why it's usually recommended to do 2n+1 instead of 2n (e.g., http://www.concentric.net/~Ttwang/tech/hashsize.htm)
然而,正如我所说,我没有看到任何真正的解释为什么加倍或加倍加一实际上是一个不错的选择,而不是为新哈希表选择大小的其他方法.
However as I said, I haven't seen any real explanation for why doubling or doubling-plus-one is actually a good choice rather than some other method of choosing a size for the new hash table.
(是的,我读过关于哈希表的维基百科文章:) http://en.wikipedia.org/wiki/Hash_table
(And yes I've read the Wikipedia article on hash tables :) http://en.wikipedia.org/wiki/Hash_table
推荐答案
哈希表不能声称摊销的常量时间插入",例如,如果调整大小是一个常量增量.在这种情况下,调整大小的成本(随着哈希表的大小而增长)将使一次插入的成本与要插入的元素总数成线性关系.由于随着表的大小调整大小变得越来越昂贵,因此必须越来越少"发生以保持插入的摊销成本不变.
Hash-tables could not claim "amortized constant time insertion" if, for instance, the resizing was by a constant increment. In that case the cost of resizing (which grows with the size of the hash-table) would make the cost of one insertion linear in the total number of elements to insert. Because resizing becomes more and more expensive with the size of the table, it has to happen "less and less often" to keep the amortized cost of insertion constant.
大多数实现允许平均桶占用增长到调整大小之前预先固定的界限(0.5 和 3 之间的任何值,这些都是可接受的值).有了这个约定,在调整大小之后,平均桶占用就变成了这个界限的一半.通过加倍调整大小将平均桶占用保持在宽度 *2 的范围内.
Most implementations allow the average bucket occupation to grow to until a bound fixed in advance before resizing (anywhere between 0.5 and 3, which are all acceptable values). With this convention, just after resizing the average bucket occupation becomes half that bound. Resizing by doubling keeps the average bucket occupation in a band of width *2.
子注:由于统计聚类,如果您希望多个桶最多有一个元素,则必须将平均桶占用率低至 0.5(忽略缓存大小的复杂影响的最大查找速度),或如果您想要最少数量的空桶(对应于浪费的空间),则最高可达 3.
Sub-note: because of statistical clustering, you have to take an average bucket occupation as low as 0.5 if you want many buckets to have at most one elements (maximum speed for finding ignoring the complex effects of cache size), or as high as 3 if you want a minimum number of empty buckets (that correspond to wasted space).
这篇关于为什么哈希表扩展通常是通过将大小加倍来完成的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!