问题描述
我最近在这里阅读了一些关于Java String类'哈希码的内容,而且我还没有找到这个信息:当字符串的长度大于32时会发生什么(我知道溢出会发生,但是哈希键,会发生什么)?
例如,我需要散列长度在20到120个字符之间的字符串,以将它们用作散列键。我是否需要使用BigInteger实现自己的算法?
I've been reading a bit about Java String class' hashcode here recently, and I haven't been able to find this information : what happens when string's length is higher than 32 (I know an overflow then happens, but as a hash key, what happens)?For example, I need to hash strings that are between 20 and 120 characters long to use them as hash keys. Do I need to implement my own algorithm using BigInteger?
另外,因为我可能有30k到80k之间的字符串,可能更多,通常的String hashcode是否足够冲突?
Also, since I might have between 30k and 80k strings, maybe more, is usual String hashcode collision-free enough?
推荐答案
在Java中,原始类型的算术溢出和下溢不会引发运行时错误或异常。结果的溢出部分就丢失了。
In Java, arithmetic overflows and underflows of primitive types do not raise runtime errors or exceptions. The overflowed portion of the result is simply lost.
如果程序员不知道这个属性,这会导致逻辑错误或其他困难,这是指定的行为JVM。
While this can result in logic errors or other difficulties if the programmer is not aware of this property, it is the specified behavior of the JVM.
计算哈希码时,您不必担心 int
类型的上溢或下溢。溢出的位只是丢失了。
You do not need to worry about overflow or underflow of int
types when calculating hashcodes. The overflowed bits are simply lost.
这不会影响计算的哈希值的正确性或者它能够很好地分配到哈希桶。
This does not affect the correctness of the computed hash value or its ability to distribute to hash buckets well.
要记住几件事可以很方便:
A couple things that can be handy to keep in mind:
-
Java字符串是不可变的。因此,String实例的哈希值只计算一次。之后,结果将缓存在实例中,以便后续调用
hashCode()
不会导致重复计算。这是有效的,因为字符串是不可变的,每次重新计算的值都是相同的。
Java Strings are immutable. For this reason, the hash value of a String instance is calculated only once. After that, the result is cached in the instance so that subsequent invocations of
hashCode()
do not result in repeated computations. This works because Strings are immutable and recomputing the value would be the same every time.
哈希码实际上应该根据实例中的所有有意义的信息来计算。这意味着如果你的String包含20k的信息,那么哈希码应该从它的所有20k中计算出来(但参见上文)。当然,有性能影响,所以你应该相应地设计你的程序。
The hash code really should be computed from all the meaningful information in an instance. This means that if your String contains 20k of information, the hash code should be computed from all 20k of it (but see above). Of course, there are performance implications, so you should design your program accordingly.
碰撞'free'-ness与质量有很大关系你的 hashCode()
实现,而不是你的字符串的大小。用于生成哈希码的算法应该能够产生良好的分布。什么是好的哈希函数并不是精确已知的,而是数学理论家的主题。幸运的是,定义一个足够好的哈希函数并不难,即使它可能不是最先进的(参见Effective Java,2nd ed .; J. Bloch)。
Collision 'free'-ness has much, much more to do with the quality of your hashCode()
implementation and less to do with the size of your Strings. Algorithms used to generate hash codes should be capable of producing good distributions. What a "good hash function" is isn't precisely known, but is a subject for mathematical theorists. Fortunately it is not hard to define a hash function that is "good enough" even if it may not be "state of the art" (see Effective Java, 2nd ed.; J. Bloch).
这篇关于Java String上哈希码溢出的后果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!