问题描述
我想澄清一下SAS哈希表中存储桶的定义.问题恰恰与 hashexp 参数有关.
I would like to have a little clarification on the definiton of a bucket in SAS hashtable. The question is exactly about the hashexp parameter.
根据SAS DOC, hashexp 为:
According to the SAS DOCs, hashexp is:
哈希对象的内部表大小,其中哈希表的大小为2n.
HASHEXP的值用作创建哈希表大小的2的幂.例如,HASHEXP的值为4等于哈希表大小为24或16.HASHEXP的最大值为20.
哈希表的大小不等于可以存储的项目数.将哈希表想象成存储桶"的数组.哈希表大小为16时将具有16个存储桶".每个存储桶可以容纳无限数量的物品.哈希表的效率在于哈希函数将项目映射到存储桶并从存储桶中检索项目的能力.
您应相对于哈希对象中的数据量设置哈希表大小,以最大程度地提高哈希对象查找例程的效率.尝试使用不同的HASHEXP值,直到获得最佳结果.例如,如果哈希对象包含一百万个项目,则哈希表大小为16(HASHEXP = 4)会起作用,但效率不高.哈希表大小为512或1024(HASHEXP = 9或10)将导致最佳性能.
问题是什么?哈希表大小是什么,而哈希对象中的数据量却不大?
The question is what exactly is a hash table size, while it is not a amount of data in the hash object?
应该理解为好像我们想要分配尽可能多的内存,但不要少于或更少.使事情快速运行是二的力量.但这并没有限制可能使用的数据量,它仅指示将要使用多少数据,对吗?
Should it be understood as if we wanted to allocate as much memory as it may be neccessary but not less, no more. It is a power of two to get things work fast. But it does not limit the amount of data possibly used, it only indicates about how much is going to be used, right?
推荐答案
Paul Dorfman(哈希大师)在本白皮书的第10页上做了相当详细的说明:
Paul Dorfman (the master of hashing) goes into a fair bit of detail on page 10 of this whitepaper:
http://www2.sas.com/proceedings/forum2008/037-2008.pdf
据我了解,哈希表将其数据存储在二进制树中.hashexp创建的每个存储桶代表将用于存储数据的二叉树的数量.hashexp为0将使用一棵树,而hashexp为8将使用256棵树.当对散列对象执行查找时,内部算法会(基于散列值)确定键应存在于哪棵树中.然后,它检查该树的值.通过自动知道要查找的256棵树中的哪棵(例如),与单个二叉树相比,它自己可以节省8个比较(2 ^ 8).
As I understand it, hashtables store their data in binary trees. Each bucket created by hashexp represents the number of binary trees that will be used to store the data. A hashexp of 0 would use a single tree, while a hashexp of 8 would use 256 trees. When a lookup is performed against the hash object, an internal algorithm determines which tree the key should exist in (based on the hashed value). It then checks that tree for the value. By automatically knowing which of the 256 trees to look in (for example) it would have saved itself 8 comparisons (2^8) when compared to a single binary tree.
整个事情似乎要复杂得多,但这就是我对为什么它能更快地工作的解释.
The whole thing seems a lot more complex than that but that's my interpretation of why it works out faster.
这篇关于hashexp在SAS HashTable中指定的表大小到底是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!