问题描述
我正在构建一个非常大的字典,我正在执行许多检查,以查看一个键是否在结构中,然后添加如果它是唯一的或增加一个计数器,如果它是相同的。Python使用存储字典(不要与加密散列函数混淆)。 Lookups是O(1),但是如果哈希表已满,那么它必须被重新排列,这是非常昂贵的。
我的问题是,我最好使用或是一个散列表足够好?
唯一可以确定的方法是为了实现这两个并且检查,但是我知道的猜测是字典将更快,因为一个二叉搜索树的成本为O(log(n))用于查找和插入,我认为除了最不喜欢的情况作为大规模哈希冲突)哈希表的O(1)查找将超过偶尔的调整大小。
如果您查看,您会看到:
- 一个字典从8个条目开始(
PyDict_MINSIZE
); - a字数增加时,具有50,000个或更少的字符大小的四倍;
- 一个具有超过50,000个条目的字典,当它增长时,大小增加一倍;
- 键字节缓存在字典中,因此在字典调整大小时不会重新计算。
(也值得一读。)
所以如果你的字典有1,000,000个条目相信在大小调整期间,它将被重新调整大小十一次(8→32→128→512→2048→8192→32768→131072→262144→524288→1048576→2097152)。这似乎远远低于参与到AVL树中1,000,000次插入的所有重新平衡的费用。
I am build a very large dictionary and I am performing many checks to see if a key is in the structure and then adding if it unique or incrementing a counter if it is identical.
Python uses a hash data structure to store dictionaries (not to be confused with a cryptographic hash function). Lookups are O(1), but if the hash table is full the it has to be rehashed which is very expensive.
My Question is, would I be better off using a AVL Binary Search Tree or is a hash table good enough?
The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.
If you take a look at the Python dictionary implementation, you'll see that:
- a dictionary starts out with 8 entries (
PyDict_MINSIZE
); - a dictionary with 50,000 or fewer entries quadruples in size when it grows;
- a dictionary with more than 50,000 entries doubles in size when it grows;
- key hashes are cached in the dictionary, so they are not recomputed when the dictionary is resized.
(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)
So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.
这篇关于Python的基础哈希数据结构的字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!