问题描述
番石榴的ImmutableSet
在我关于contains
的基准测试中似乎表现不佳.对于某些尺寸,它甚至比List
慢得多:
Guava's ImmutableSet
seems to perform quite poorly in my benchmark concerning contains
. For some sizes it gets even much slower than List
:
size benchmark ns linear runtime
100000 ListContains 110279.54 ==
100000 SetContains 7.15 =
100000 ImmutableSetContains 76716.47 =
200000 ListContains 275367.66 =====
200000 SetContains 7.34 =
200000 ImmutableSetContains 322185.50 ======
500000 ListContains 935210.10 ====================
500000 SetContains 7.79 =
500000 ImmutableSetContains 1382765.76 ==============================
基本上,我用数千个负整数填充一个集合,并用非负整数测试包含的整数.该代码是微不足道的,但是对于粘贴在较小的文本区域中来说太长了,因此请查看此处.
Basically, I fill a set with a few thousands negative integers and test contains with non-negative ones. The code is trivial but a bit too long for pasting in a small text area, so please look here.
我想知道这是怎么回事.可能是我遇到了一些简陋的案件,尽管我显然没有尝试过.也许我只是吹破了基准.否则,我想知道是否可以而且应该解决该问题.
I wonder what's going on here. Probably, I hit some degenerate case, although I obviously didn't try to. Or maybe I've just blew the benchmark. Otherwise, I wonder if it can and should be fixed.
解决方案是通过替换
hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);
作者
return C2 * Integer.rotateLeft(hashCode * C1, 15);
这大约需要相同的时间,并且可能有一些缺点,但是可以很好地散布散列来解决当前的问题.
This takes about the same time and might have some disadvantage, but solves the current problem by nicely spreading the hashes.
推荐答案
说明实际上非常简单.想象一下,整数位于某些N
的集合M = {0, 1, ..., N-1}
中,这是2的幂.由于Integer.hashCode
的定义方式,哈希也是如此.通过与smear处理哈希.java#HashMap.hash%28int%29"rel =" nofollow>此,以便在某些常见情况下将最低位的冲突降到最低.
The explanation is actually very simple. Imagine the integers lie in the set M = {0, 1, ..., N-1}
for some N
, which is a power of two. And so do the hashes, because of how Integer.hashCode
is defined. The hashes get processed via a function smear
identical to this one in order to minimize collision in the lowest bits in some common cases.
分配大小为2*N
的表,并将M
的成员放入表中.由于smear
仅涉及向右移位,因此它将M
映射到自身,这意味着将填充表的连续范围.因此,可以说表的左半部分中的所有插槽均已使用,而另一半未使用.
A table of size 2*N
gets allocated and the members of M
get put into it. As smear
involves only right shifts, it maps M
onto itself, which means that a continuous range of the table gets filled. So lets say that all slots in the left half of the table are used and the other half is unused.
当调用contains(o)
时,搜索从由o.hashCode()
确定位置的插槽开始.如果找到o
,则结果为true
,如果命中了一个空插槽,则结果为false
.否则,搜索进行到另一个时隙.为了最大程度地减少缓存未命中,使用了线性探测.
When contains(o)
gets invoked, the search starts in a slot which position is determined by o.hashCode()
. If o
gets found, the result is true
, if an empty slot gets hit, the result is false
. Otherwise, the search proceeds to another slot. In order to minimize cache misses, linear probing gets used.
当我们很不幸无法在第一个使用的插槽开始搜索时,必须遍历所有这些位置,这意味着N
步骤.从随机位置开始意味着平均N/4
步进.
When we are unlucky enough to start the search at the first used slot, all of them have to be traversed, which means N
steps. Starting at a random position means N/4
steps on the average.
在我的基准测试中发生的情况与上述情况不完全相同,但是其性能不佳的原因是相同的.将大小乘以2的幂会使问题变得更糟.
What happened in my benchmark is not exactly as above, but the reason for its bad performance is the same. Making the sizes to powers of two makes the problem worse.
这篇关于番石榴的ImmutableSet.contains的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!