问题描述
我目前正在研究是否有可能加快一辆面包车昂德博厄斯(或任何树)树的遍历。给定一个搜索查询作为输入,已经有在高速缓存行(面包车EMDE博阿斯布局)多个树节点,树的遍历似乎是指令瓶颈。
I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multiple tree nodes in the cache line (van emde Boas layout), tree traversal seems to be instruction-bottlenecked.
作为还挺新的SIMD / AVX / SSE指令,我想从主题专家知道是否有可能到多个节点一次比较的值,然后找出跟踪的代码树通路上。我的研究导致以下问题:
Being kinda new to SIMD/AVX/SSE instructions, I would like to know from experts in that topic whether it would be possible to compare multiple nodes at once to a value and then find out which tree path to follow further on. My research lead to the following question:
多少个CPU周期/指令被浪费上注册等等。这将使得其对鲁尼使用,如果施工需要更多的时间比手动遍历整个子树(2 + 4 SIMD / AVX / SSE建设+在1超高速缓存行大小为64字节的8个节点)。
How many cpu cycles/instructions are wasted on construction of SIMD/AVX/SSE register etc.. This would make its use for the wayne, if construction takes more time than traversing the whole sub-tree manually (2+4+8 nodes in 1 cacheline of size 64 bytes).
多少个CPU周期/指令上找到适当的SIMD / AVX / SSE寄存器浪费拿着答案的路径来遵循?任何人可以想出一个聪明的办法,使那些findMinimumIntegerAVX指令可以用来决定在1(??)CPU周期?
How many cpu cycles/instructions are wasted on finding the proper SIMD/AVX/SSE register holding the answer of which path to follow on ? Could anybody come up with a smart way so that those "findMinimumInteger" AVX instructions could be used to decide that in 1 (??) cpu cycle ?
什么是你的猜测?
另一个更棘手的方法来加快树遍历将有多个搜索querys向下运行一次,当有很高的概率在节点中的最后一棵树级别土地紧密地结合在一起。任何对此的猜测? OFC那就得把那些querys抛开那些不属于同一子树再然后递归地发现他们整理树的第一个平行穿越之后。该树querys有顺序的,虽然不是恒定的访问模式(查询[I]总是小于查询比第[i + 1])。
Another, more tricky approach to speed up tree traversal would be to have multiple search querys run down at once, when there is high probability to land in nodes closely together in the last tree level. Any guesses on this ? Ofc it would have to put those querys aside that do not belong to the same sub-tree any longer and then recursively find them after finishing the first "parallel traversal" of the tree.. The tree querys have sequential, though not constant access patterns (query[i] always < than query[i+1]).
重要提示:这东西是关于整树的,这就是为什么用面包车昂德博阿斯树(可能的X快速/ Y-快速尝试以后)
Important: this stuff is about integer tree's, which is why van Emde Boas Tree is used (maybe x-fast/y-fast tries later on)
我好奇的是什么这个问题上你50美分,因为人们可能有兴趣在大规模树的最高achieveable性能。预先感谢您对这个虽然你的时间花费: - )
I am curious about what is your 50 cents on this issue, given that one might be interested in the highest achieveable performance on large scale tree's. Thank you in advance for your time spending on this though :-)
推荐答案
我用SSE2 / AVX2,以帮助执行B +树搜索。这里是code就在AVX2 16个DWORDs整个缓存线执行二进制搜索:
I've used SSE2/AVX2 to help perform a B+tree search. Here's code to perform a binary search on a full cache line of 16 DWORDs in AVX2:
// perf-critical: ensure this is 64-byte aligned. (a full cache line)
union bnode
{
int32_t i32[16];
__m256i m256[2];
};
// returns from 0 (if value < i32[0]) to 16 (if value >= i32[15])
unsigned bsearch_avx2(bnode const* const node, __m256i const value)
{
__m256i const perm_mask = _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0);
// compare the two halves of the cache line.
__m256i cmp1 = _mm256_load_si256(&node->m256[0]);
__m256i cmp2 = _mm256_load_si256(&node->m256[1]);
cmp1 = _mm256_cmpgt_epi32(cmp1, value); // PCMPGTD
cmp2 = _mm256_cmpgt_epi32(cmp2, value); // PCMPGTD
// merge the comparisons back together.
//
// a permute is required to get the pack results back into order
// because AVX-256 introduced that unfortunate two-lane interleave.
//
// alternately, you could pre-process your data to remove the need
// for the permute.
__m256i cmp = _mm256_packs_epi32(cmp1, cmp2); // PACKSSDW
cmp = _mm256_permutevar8x32_epi32(cmp, perm_mask); // PERMD
// finally create a move mask and count trailing
// zeroes to get an index to the next node.
unsigned mask = _mm256_movemask_epi8(cmp); // PMOVMSKB
return _tzcnt_u32(mask) / 2; // TZCNT
}
您会拥有一个高度predictable每 B节点
枝,以测试树月底已经达到了。
You'll end up with a single highly predictable branch per bnode
, to test if the end of the tree has been reached.
这应该是平凡扩展到AVX-512。
This should be trivially scalable to AVX-512.
要preprocess,摆脱慢 PERMD
指令,这将用于:
To preprocess and get rid of that slow PERMD
instruction, this would be used:
void preprocess_avx2(bnode* const node)
{
__m256i const perm_mask = _mm256_set_epi32(3, 2, 1, 0, 7, 6, 5, 4);
__m256i *const middle = (__m256i*)&node->i32[4];
__m256i x = _mm256_loadu_si256(middle);
x = _mm256_permutevar8x32_epi32(x, perm_mask);
_mm256_storeu_si256(middle, x);
}
这篇关于使用SIMD / AVX / SSE的树遍历的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!