问题描述
我正在尝试为以下功能编写AXV版本:
I have the following function I'm trying to write an AXV version for:
void
hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length)
{
size_t i, j, v, p;
char temp;
if (!salt_length) {
return;
}
for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) {
v %= salt_length;
p += salt[v];
j = (salt[v] + v + p) % i;
temp = str[i];
str[i] = str[j];
str[j] = temp;
}
}
我正在尝试向量化v %= salt_length;
.
我想初始化一个包含从0到str_length的数字的向量,以便使用SVML的 _mm_rem_epu64 ,以便为每次循环迭代计算v.
如何正确初始化向量?
I'm trying to vectorize v %= salt_length;
.
I want to initialize a vector that contains numbers from 0 to str_length in order to use SVML's _mm_rem_epu64 in order to calculate v for each loop iteration.
How do I initialize the vector correctly?
推荐答案
问如何初始化向量基本上是在寻求教程. Google提出了一些Intel使用内在函数的指南.我将假装这个问题并非微不足道,并且尝试有效地实现此功能的答案.绝对不是您作为初学者矢量化的那种功能.
Asking just how to initialize a vector is basically asking for a tutorial. Google up some of Intel's guides to using intrinsics. I'm going to pretend the question wasn't trivial, and answer about trying to implement this function efficiently. It's definitely not the kind of function you'd try to vectorize as a total beginner.
请参见 x86 标签Wiki的问题,以获取指向以下链接的链接docs,尤其是.英特尔的内在函数指南.请参阅 sse 标签Wiki的问题,以获取使用SSE进行SIMD编程非常不错的介绍内在函数,以及如何有效使用SIMD ,以及其他链接.
See the x86 tag wiki for links to docs, esp. Intel's intrinsics guide. See the sse tag wiki for a link to a very nice intro to SIMD programming with SSE intrinsics, and how to use SIMD effectively, among other links.
内容摘要:
- 展开/矢量化免费处理
v % salt_length
. - 如果
v++; v %= loop_invariant;
不是2的幂或编译时常数,如何向量化v++; v %= loop_invariant;
.为此目的提供了有关标题问题的答案,有关使用_mm_set_epi8
或其他初始化矢量的方式. - 如何开始矢量化一个复杂的循环,如下所示:首先展开查找串行依赖项.
-
完整功能的未经测试的版本,可对除
% i
和交换以外的所有内容进行矢量化处理. (即按照您的要求对所有便宜的操作进行矢量化处理.)
- Unrolling/vectorizing takes care of
v % salt_length
for free. - How you could vectorize
v++; v %= loop_invariant;
if it wasn't a power of 2 or compile-time constant. Includes an answer to the title question about using_mm_set_epi8
or other ways of initializing a vector for this purpose. - How to start vectorizing a complex loop like this: start with unrolling to find serial dependencies.
an untested version of the complete function that vectorizes everything except the
% i
and the swap. (i.e. vectorizes all the operations that were cheap anyway, like you asked).
-
(v + salt[v] + p)
(及其导致的所有内容)矢量化为两条vpaddw
指令.循环外用于向量化p
的前缀和设置很棘手,但我最终也对其向量化了.
(v + salt[v] + p)
(and everything leading up to it) vectorizes to twovpaddw
instructions. The prefix-sum setup outside the loop for vectorizingp
was tricky, but I eventually vectorized it, too.
该函数的绝大多数运行时间将位于标量内部循环中,该标量内部循环为j
元素的向量,div
上的瓶颈(或任何SVML可以执行的操作)和/或缓存未命中大串.
The vast majority of the function's run time will be in the scalar inner loop over a vector of j
elements, bottlenecking on div
(or whatever SVML can do), and/or cache misses with very large strings.
整个循环无法轻易向量化,因为与伪随机索引的交换会创建不可预测的串行依赖关系.可以使用AVX512聚集+随机+分散,并使用AVX512CD查找冲突位掩码,但这必须是一个单独的问题.我不确定要高效地执行此操作有多困难,或者不确定是否经常要重复多次向量随机播放,而只能在一个不冲突的元素中取得进展.
The entire loop can't easily vectorize because the swaps with pseudo-random indices create an unpredictable serial dependency. Using AVX512 gather + shuffle + scatter, with AVX512CD to find conflict bitmasks, might be possible, but that would have to be a separate question. I'm not sure how hard it would be to do this efficiently, or if you'd often end up repeating a vector shuffle many times, only making progress in one non-conflicting element.
由于salt_length = sizeof(size_t)
是一个编译时常量,并且比向量长度小2的幂,因此 v++
和v%=salt_length
根本不需要循环内的任何代码,并免费发生,这是有效展开循环以并行执行多个v
值的副作用.
Since salt_length = sizeof(size_t)
is a compile-time constant and a power of 2 smaller than your vector length, v++
and v%=salt_length
don't require any code inside the loop at all, and happens for free as a side-effect of effectively unrolling the loop to do multiple v
values in parallel.
(使用依赖于平台的盐大小意味着32位版本将无法处理使用64位salt创建的数据.即使x32 ABI也具有32位size_t
,因此将其更改为uint64_t似乎是有道理的,除非您永远不需要在机器之间共享盐腌的哈希.)
(Using a platform-dependent salt size means that a 32-bit build won't be able to process data created with 64-bit salt. Even the x32 ABI has 32-bit size_t
, so changing to uint64_t would seem to make sense, unless you never need to share salted hashes between machines.)
在标量循环中,v
遵循0 1 2 3 0 1 2 3 ...(对于64位为0..7)的重复模式.在矢量代码中,对于32字节向量中的4B个元素,我们可能一次执行8个v
值,或者对于2B个元素一次执行16次迭代.
In the scalar loop, v
follows a repeating pattern of 0 1 2 3 0 1 2 3 ... (or 0..7, for 64-bit). In vector code, we're doing maybe 8 v
values at once with 4B elements in a 32-byte vector, or 16 iterations at once with 2B elements.
因此,v
成为循环不变的常数向量.有趣的是,salt[v]
也是如此,因此我们无需在循环内进行任何盐表查找.实际上,v+salt[v]
可以预先计算标量和向量.
So v
becomes a loop-invariant constant vector. Interestingly, so does salt[v]
, so we never need to do any salt table lookups inside the loop. In fact, v+salt[v]
can be pre-computed for scalar and vector.
标量版本应预先计算v+salt[v]
并以4或8展开,删除LUT查找,以便所有内存/高速缓存吞吐量可用于实际交换.编译器可能不会为您执行此操作,因此您可能需要手动展开并编写额外的代码来处理最后一个奇数个字符串字节. (无需展开,您仍然可以预先计算v+salt[v]
的查找表,其查找类型必须足够宽以至于不能环绕).
The scalar version should pre-compute v+salt[v]
and unroll by 4 or 8, too, removing the LUT lookup so all the memory/cache throughput is available for the actual swaps. The compiler probably won't do this for you, so you probably need to unroll manually and write extra code to handle the last odd number of string bytes. (Without unrolling, you could still pre-compute a lookup table of v+salt[v]
, with a type wide enough not to wrap around).
即使只是确保在编译时已知salt_length
也将允许更好的代码. v %= compile_time_constant
比div
insn便宜,并且在2的幂时非常便宜(它变成v &= 7
).如果标量版本可以内联,或者如果您完全使用salt_length = sizeof(size_t)
而不是将其作为函数arg传递,则编译器可能会为您执行此操作.
Even just making sure salt_length
is known at compile time would also allow much better code. v %= compile_time_constant
is cheaper than a div
insn, and extremely cheap when it's a power of 2. (It just turns into v &= 7
). The compiler might possibly do this for you if the scalar version can inline, or if you used salt_length = sizeof(size_t)
instead of passing it as a function arg at all.
如果您还不知道salt_length:,即@harold在揭示有关salt_length的关键信息之前的建议:
If you didn't already know salt_length: i.e. what @harold was suggesting before you revealed the critical information about salt_length:
由于我们知道v < salt_length
开头,因此我们最多只需要将v -= salt_length
包裹回正确的范围并保持该不变性即可.这称为强度降低"操作,因为减法比除法更弱(且更便宜).
Since we know v < salt_length
to start with, we only ever need at most one v -= salt_length
to wrap it back into the right range and maintain that invariant. This is called a "strength reduction" operation, because subtraction is a weaker (and cheaper) operation than division.
// The scalar loop would benefit from this transformation, too.
// or better, unroll the scalar loop by 8 so everything becomes constant
v++;
if( v >= salt_length)
v-= salt_length;
要向量化 just 这:让我们假装我们只知道salt_length <= 16
,因此我们可以使用32个uint8_t值的向量. (而且我们可以使用pshufb对salt[v]
LUT查找进行矢量化处理.)
To vectorize just this: let's pretend that all we know is salt_length <= 16
, so we can use a vector of 32 uint8_t values. (And we can use pshufb to vectorize the salt[v]
LUT lookup).
// untested // Vectorizing v++; v %= unknown_loop_invariant_value;
if (!salt_length) return;
assert(salt_length <= 16); // so we can use pshufb for the salt[v] step
__m256i vvec = _mm256_setr_epi8( // setr: lowest element first, unlike set
0%salt_length, 1%salt_length, 2%salt_length, 3%salt_length,
4%salt_length, 5%salt_length, 6%salt_length, 7%salt_length,
8%salt_length, 9%salt_length, 10%salt_length, 11%salt_length,
12%salt_length, 13%salt_length, 14%salt_length, 15%salt_length,
16%salt_length, 17%salt_length, 18%salt_length, 19%salt_length,
20%salt_length, 21%salt_length, 22%salt_length, 23%salt_length,
24%salt_length, 25%salt_length, 26%salt_length, 27%salt_length,
28%salt_length, 29%salt_length, 30%salt_length, 31%salt_length);
__m256i v_increment = _mm256_set1_epi8(32 % salt_length);
__m256i vmodulus = _mm256_set1_epi8(salt_length);
// salt_lut = _mm256_set1_epi64x(salt_byval); // for known salt length. (pass it by val in a size_t arg, instead of by char*).
// duplicate the salt into both lanes of a vector. Garbage beyond salt_length isn't looked at.
__m256i salt_lut = _mm256_broadcastsi128_si256(_mm_loadu_si128(salt)); // nevermind that this could segfault if salt is short and at the end of a page.
//__m256i v_plus_salt_lut = _mm256_add_epi8(vvec, salt_lut); // not safe with 8-bit elements: could wrap
// We could use 16-bit elements and AVX512 vpermw (or vpermi2w to support longer salts)
for (...) {
vvec = _mm256_add_epi8(vvec, v_increment); // ++v;
// if(!(salt_length > v)) { v-= salt_length; }
__m256i notlessequal = _mm256_cmpgt_epi8(vmodulus, vvec); // all-ones where salt_length > v.
// all-zero where salt_length <= v, where we need to subtract salt_length
__m256i conditional_sub = _mm256_and_si256(notlessequal, vmodulus)
vvec = _mm256_sub_epi8(vvec, conditional_sub); // subtract 0 or salt_length
// salt[v] lookup:
__m256i saltv = _mm256_shuffle_epi8(salt_lut, vvec); // salt[v]
// then maybe pmovzx and vextracti128+pmovzx to zero-extend to 16-bit elements? Maybe vvec should only be a 16-bit vector?
// or unpack lo/hi with zeros (but that behaves differently from pmovzx at the lane boundary)
// or have vvec already holding 16-bit elements with the upper half of each one always zero. mask after the pshufb to re-zero,
// or do something clever with `vvec`, `v_increment` and `vmodulus` so `vvec` can have `0xff` in the odd bytes, so pshufb zeros those elements.
}
当然,如果我们知道salt_length是2的幂,那么我们应该屏蔽掉每个元素中除相关低位以外的所有位:
Of course, if we knew salt_length was a power of 2, we should have just masked off all but the relevant low bits in each element:
vvec = _mm256_add_epi8(vvec, _mm256_set1_epi8(salt_length)); // ++v;
vvec = _mm256_and_si256(vvec, _mm256_set1_epi8(salt_length - 1)); // v &= salt_length - 1; // aka v%=salt_length;
奇怪的是,我们意识到一次只对一行进行矢量化是一个坏主意,因为现在我们必须更改所有已编写的代码以使用更宽的元素,有时需要一个做同一件事的不同策略.
Noticing that we started with the wrong element size is when we realize that only vectorizing one line at a time was a bad idea, because now we have to go change all the code we already wrote to use wider elements, sometimes requiring a different strategy to do the same thing.
当然,您确实需要从粗略的轮廓开始,无论是心理上还是书面上,以了解如何分别完成每个步骤.在思考这个过程的过程中,您将看到不同的部分可能如何组合在一起.
Of course, you do need to start with a rough outline, mental or written down, for how you might do each step separately. It's in the process of thinking this through that you see how the different parts might fit together.
对于复杂的循环,一个有用的第一步可能是尝试手动展开标量循环.这将有助于找到串行依赖关系,并通过展开来简化操作.
我们需要足够宽的元素以容纳i
的最大值,因为i
不是 的2的幂,并且不是恒定的,因此取模运算会起作用.任何更大的面积都是浪费,并降低了我们的产量.如果我们可以对整个循环的其余部分进行矢量化处理,则甚至有必要专门针对str_length
的不同范围使用不同版本的函数. (或者也许用64b元素循环直到i <= UINT32_MAX,然后循环直到i <= UINT16_MAX,依此类推).如果您知道不需要处理大于4GiB的字符串,则可以仅使用32位数学运算来加快常见情况. (即使高位全为零,64位除法也比32位除法慢).
We need elements wide enough to hold the maximum value of i
, because i
is not a power of 2, and not constant, so that modulo operation takes work. Any wider is a waste and cuts our throughput. If we could vectorize the whole rest of the loop, it might even be worth specializing the function with different versions for different ranges of str_length
. (Or maybe looping with 64b elements until i<= UINT32_MAX, then loop until i<=UINT16_MAX, etc). If you know you don't need to handle strings > 4GiB, you can speed up the common case by only using 32-bit math. (64-bit division is slower than 32-bit division, even when the upper bits are all zero).
实际上,我认为我们需要与最大p
一样宽的元素,因为它会永远累积(直到在原始标量代码中以2 ^ 64换行).与恒定模数不同,即使取模是分布式的,我们也不能仅使用p%=i
来对其进行检查. (123 % 33) % (33-16) != 123 % (33-16)
.甚至对齐16也无济于事:12345%32!= 12345%48%32
Actually, I think we need elements as wide as the maximum p
, since it keeps accumulating forever (until it wraps at 2^64 in the original scalar code). Unlike with a constant modulus, we can't just use p%=i
to keep it in check, even though modulo is distributive. (123 % 33) % (33-16) != 123 % (33-16)
. Even aligning to 16 doesn't help: 12345 % 32 != 12345 % 48 % 32
即使对于相当大的i
值,这也会迅速使p
太大,以至于i
的重复条件减法(直到条件掩码全为假).
This will quickly make p
too large for repeated conditional subtraction of i
(until the condition mask is all false), even for fairly large i
values.
有一些技巧可以通过已知的整数常量取模(请参见 http://libdivide.com/),但是AFAIK为附近的除数(即使具有16的2的幂数)计算乘法的模逆并不比完全独立的数容易.因此,我们不能便宜地仅调整i
值的下一个向量的常量.
There are tricks for modulo by known integer constants (see http://libdivide.com/), but AFAIK working out the multiplicative modular inverse for a nearby divisor (even with a power-of-two stride like 16) isn't any easier than for a totally separate number. So we couldn't cheaply just adjust the constants for the next vector of i
values.
小数定律也许值得剥下最后一对向量迭代,带有乘法模逆的预先计算的向量因此% i
可以通过向量来完成.一旦我们接近字符串的末尾,它在L1缓存中可能很热,因此我们完全成为了div
的瓶颈,而不是交换加载/存储.为此,我们可能会使用序言来达到i
值,该值是16的倍数,因此,当我们接近i = 0时,最后几个向量始终具有相同的i
值对齐方式.否则,我们将有一个常数LUT用于一系列i值,并从中简单地进行未对齐的载荷.这意味着我们不必旋转salt_v
和p
.
The law of small numbers perhaps makes it worth-while to peel off the last couplevector iterations, with pre-computed vectors of multiplicative modular inversesso the % i
can be done with vectors. Once we're close to the end of the string, it's probably hot in L1 cache so we're totally bottlenecked on div
, not the swap loads/stores. For this, we'd maybe use a prologue to reach an i
value that was a multiple of 16, so the last few vectors as we approach i=0 always have the same alignment of i
values. Or else we'd have a LUT of constants for a range of i values, and simply do unaligned loads from it. That means we don't have to rotate salt_v
and p
.
可能转换为FP很有用,因为最近的Intel CPU(尤其是Skylake)具有功能非常强大的FP分区硬件,并且具有显着的流水线化(吞吐量:延迟率).如果我们能够通过正确的舍入选择获得准确的结果,那就太好了. (float
和double
可以精确表示任何等于其尾数大小的整数.)
Possibly converting to FP would be useful, because recent Intel CPUs (especially Skylake) have very powerful FP division hardware with significant pipelining (throughput : latency ratio). If we can get exact results with the right choice of rounding, this would be great. (float
and double
can exactly represent any integer up to about the size of their mantissa.)
我认为值得尝试Intel的_mm_rem_epu16
(使用向量set1(16)
递减的向量i
值).如果他们使用FP来获得准确的结果,那就太好了.如果仅将其解压缩为标量并进行整数除法,则会浪费时间将值重新返回到向量中.
I guess it's worth trying Intel's _mm_rem_epu16
(with a vector of i
values that you decrement with a vector of set1(16)
). If they use FP to get accurate results, that's great. If it just unpacks to scalar and does integer division, it would waste time getting values back in a vector.
无论如何,当然,最简单的解决方案是使用标量循环迭代矢量元素.在使用AVX512CD进行交换之前,您可能会非常想像,这似乎是合理的,但是,如果它们在L1缓存中都很热,它的交换速度可能会比交换交换要慢一个数量级.
Anyway, certainly the easiest solution is to iterate of vector elements with a scalar loop. Until you come up with something extremely fancy using AVX512CD for the swaps, it seems reasonable, but it's probably about an order of magnitude slower than just the swaps would be, if they're all hot in L1 cache.
这是 Godbolt编译器浏览器上的代码,带有完整的设计注释,包括我在计算SIMD前缀和算法时所做的图.我最终想起了我在 @ZBoson的浮动结构中看到的一个狭窄版本指向SSE前缀总和答案,但要等到我自己重新发明后才可以.
Here's the code on the Godbolt compiler explorer, with full design-notes comments, including the diagrams I made while figuring out the SIMD prefix-sum algo. I eventually remembered I'd seen a narrower version of this as a building block in @ZBoson's floating point SSE Prefix sum answer, but not until after mostly re-inventing it myself.
// See the godbolt link for full design-notes comments
// comments about what makes nice ASM or not.
#include <stdint.h>
#include <stdlib.h>
#include <immintrin.h>
#include <assert.h>
static inline
__m256i init_p_and_increment(size_t salt_length, __m256i *p_increment, __m256i saltv_u16, __m128i saltv_u8)
{ // return initial p vector (for first 16 i values).
// return increment vector by reference.
if (salt_length == 4) {
assert(0); // unimplemented
// should be about the same as length == 8. Can maybe factor out some common parts, like up to psum2
} else {
assert(salt_length == 8);
// SIMD prefix sum for n elements in a vector in O(log2(n)) steps.
__m128i sv = _mm256_castsi256_si128(saltv_u16);
__m128i pshift1 = _mm_bslli_si128(sv, 2); // 1 elem (uint16_t)
__m128i psum1 = _mm_add_epi16(pshift1, sv);
__m128i pshift2 = _mm_bslli_si128(psum1, 4); // 2 elem
__m128i psum2 = _mm_add_epi16(pshift2, psum1);
__m128i pshift3 = _mm_bslli_si128(psum2, 8); // 4 elem
__m128i psum3 = _mm_add_epi16(pshift3, psum2); // p_initial low 128. 2^3 = 8 elements = salt_length
// psum3 = the repeating pattern of p values. Later values just add sum(salt[0..7]) to every element
__m128i p_init_low = psum3;
__m128i sum8_low = _mm_sad_epu8(saltv_u8, _mm_setzero_si128()); // sum(s0..s7) in each 64-bit half
// alternative:
// sum8_low = _mm_bsrli_si128(p_init_low, 14); // has to wait for psum3 to be ready: lower ILP than doing psadbw separately
__m256i sum8 = _mm256_broadcastw_epi16(sum8_low);
*p_increment = _mm256_slli_epi16(sum8, 1); // set1_epi16(2*sum(salt[0..7]))
__m128i p_init_high = _mm_add_epi16(p_init_low, _mm256_castsi256_si128(sum8));
__m256i p_init = _mm256_castsi128_si256(p_init_low);
p_init = _mm256_inserti128_si256(p_init, p_init_high, 1);
// not supported by gcc _mm256_set_m128i(p_init_high, psum3);
return p_init;
}
}
void
hashids_shuffle_simd(char *restrict str, size_t str_length, size_t salt_byval)
{
//assert(salt_length <= 16); // so we can use pshufb for the salt[v] step for non-constant salt length.
// platform-dependent salt size seems weird. Why not uint64_t?
size_t salt_length = sizeof(size_t);
assert(str_length-1 < UINT16_MAX); // we do p + v + salt[v] in 16-bit elements
// TODO: assert((str_length-1)/salt_length * p_increment < UINT16_MAX);
__m128i saltv_u8;
__m256i v, saltv;
if(salt_length == 4) {
v = _mm256_set1_epi64x(0x0003000200010000); // `v%salt_length` is 0 1 2 3 repeating
saltv_u8 = _mm_set1_epi32( salt_byval );
saltv = _mm256_cvtepu8_epi16( saltv_u8 ); // salt[v] repeats with the same pattern: expand it to 16b elements with pmovzx
} else {
assert(salt_length == 8);
v = _mm256_cvtepu8_epi16( _mm_set1_epi64x(0x0706050403020100) );
saltv_u8 = _mm_set1_epi64x( salt_byval );
saltv = _mm256_cvtepu8_epi16( saltv_u8 );
}
__m256i v_saltv = _mm256_add_epi16(v, saltv);
__m256i p_increment;
__m256i p = init_p_and_increment(salt_length, &p_increment, saltv, saltv_u8);
for (unsigned i=str_length-1; i>0 ; /*i-=16 */){
// 16 uint16_t j values per iteration. i-- happens inside the scalar shuffle loop.
p = _mm256_add_epi16(p, p_increment); // p += salt[v]; with serial dependencies accounted for, prefix-sum style
__m256i j_unmodded = _mm256_add_epi16(v_saltv, p);
// size_t j = (v + saltv[v] + p) % i;
//////// scalar loop over 16 j elements, doing the modulo and swap
// alignas(32) uint16_t jbuf[16]; // portable C++11 syntax
uint16_t jbuf[16] __attribute__((aligned(32))); // GNU C syntax
_mm256_store_si256((__m256i*)jbuf, j_unmodded);
const int jcount = sizeof(jbuf)/sizeof(jbuf[0]);
for (int elem = 0 ; elem < jcount ; elem++) {
if (--i == 0) break; // in fact returns from the whole function.
// 32-bit division is significantly faster than 64-bit division
unsigned j = jbuf[elem] % (uint32_t)i;
// doubtful that vectorizing this with Intel SVML _mm_rem_epu16 would be a win
// since there's no hardware support for it. Until AVX512CD, we need each element in a gp reg as an array index anyway.
char temp = str[i];
str[i] = str[j];
str[j] = temp;
}
}
}
这可编译为外观正确的asm,但我尚未运行它.
This compiles to asm that looks about right, but I haven't run it.
Clang产生了一个相当合理的内部循环. -fno-unroll-loops
带有此标记,以提高可读性.出于性能考虑,尽管将其忽略不计,因为循环开销不是瓶颈.
Clang makes a fairly sensible inner loop. This is with -fno-unroll-loops
for readability. Leave that out for performance, although it won't matter here since loop overhead isn't the bottleneck.
# The loop part of clang3.8.1's output. -march=haswell -fno-unroll-loops (only for human readability. Normally it unrolls by 2).
.LBB0_6: # outer loop # in Loop: Header=BB0_3 Depth=1
add esi, 1
.LBB0_3: # first iteration entry point # =>This Loop Header: Depth=1
vpaddw ymm2, ymm2, ymm1 # p += p_increment
vpaddw ymm3, ymm0, ymm2 # v+salt[v] + p
vmovdqa ymmword ptr [rsp], ymm3 # store jbuf
add esi, -1
lea r8, [rdi + rsi]
mov ecx, 1
.LBB0_4: # inner loop # Parent Loop BB0_3 Depth=1
# gcc's version fully unrolls the inner loop, leading to code bloat
test esi, esi # if(i==0) return
je .LBB0_8
movzx eax, word ptr [rsp + 2*rcx - 2] # load jbuf
xor edx, edx
div esi
mov r9b, byte ptr [r8] # swap
mov al, byte ptr [rdi + rdx]
mov byte ptr [r8], al
mov byte ptr [rdi + rdx], r9b
add esi, -1
add r8, -1
cmp rcx, 16 # silly clang, not macro-fusing cmp/jl because it wants to use a weird way to increment.
lea rcx, [rcx + 1]
jl .LBB0_4 # inner loop
jmp .LBB0_6 # outer loop
这篇关于如何初始化范围从0到N的SIMD向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!