原子清除无符号整数的最低非零位

本文介绍了原子清除无符号整数的最低非零位的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题:
我正在寻找以线程安全的方式清除像 std::atomic_uint64_t 这样的无符号原子的最低非零位的最佳方法，而不使用额外的互斥锁等.另外，我还需要知道，哪个位被清除了.

Question:
I'm looking for the best way to clear the lowest non-zero bit of a unsigned atomic like std::atomic_uint64_t in a threadsafe fashion without using an extra mutex or the like. In addition, I also need to know, which bit got cleared.

示例:假设存储的当前值是 0b0110 我想知道最低的非零位是第 1 位(0 索引)并将变量设置为 0b0100.

Example:Lets say, if the current value stored is 0b0110 I want to know that the lowest non-zero bit is bit 1 (0-indexed) and set the variable to 0b0100.

我想出的最好的版本是这个:

The best version I came up with is this:

#include <atomic>
#include <cstdint>

inline uint64_t with_lowest_non_zero_cleared(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t only_keep_lowest_non_zero(std::uint64_t v){
    return v & ~with_lowest_non_zero_cleared(v);
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(!foo.compare_exchange_weak(expected,with_lowest_non_zero_cleared(expected), std::memory_order_seq_cst, std::memory_order_relaxed))
    {;}
    return only_keep_lowest_non_zero(expected);
}

有更好的解决方案吗?

注意事项:

它不一定是最低的非零位 - 如果有区别的话，我也会对清除最高位或设置最高/最低零位"的解决方案感到满意
我非常喜欢标准的 c++ (17) 版本，但会接受一个对 clang 和 msvc 使用内在函数的答案
它应该是可移植的(针对 x64 和 AArch64 使用 clang 或 msvc 编译)，但我最感兴趣的是使用 clang 编译时最近的 intel 和 AMD x64 架构的性能.
更新必须是原子的并且必须保证全局进度，但就像上面的解决方案一样，它不必是等待免费算法(当然，如果你能告诉我，我会很高兴一).

推荐答案

在 x86 上没有对原子清除最低位的直接硬件支持.BMI1 blsr 仅适用于内存源形式，而不是内存- 目的地表格；lock blsr [shared_var] 不存在.(编译器知道如何在使用 -march=haswell 编译时将 (var-1) & (var) 优化为本地变量的 blsr或以其他方式启用假定 BMI1 支持的代码生成.)因此，即使您可以假定 BMI1 支持，它也不会让您做任何根本不同的事情.

There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr is only available in memory-source form, not memory-destination form; lock blsr [shared_var] does not exist. (Compilers know how to optimize (var-1) & (var) into blsr for local vars when you compile with -march=haswell or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support, it doesn't let you do anything fundamentally different.

您在 x86 上能做的最好的事情是您在问题中提出的 lock cmpxchg 重试循环.这是一个比在旧版本的变量中找到正确的位然后使用 lock btr 更好的选择，特别是如果在两者之间设置了较低的位，清除错误的位将是一个正确性问题位扫描和lock btr.而且您仍然需要一个重试循环，以防另一个线程已经清除了您想要的位.

The best you can do on x86 is the lock cmpxchg retry loop you propose in the question. This is a better option than finding the right bit in an old version of the variable and then using lock btr, especially if it would be a correctness problem to clear the wrong bit if a lower bit was set between the bit-scan and the lock btr. And you'd still need a retry loop in case another thread already cleared the bit you wanted.

(除非您的程序对共享变量有非常高的争用，即使使用 lock add 没有尝试，只是硬件仲裁以访问缓存行，这也会对性能产生问题.如果这是在这种情况下，您可能应该重新设计您的无锁算法和/或考虑某种操作系统辅助的睡眠/唤醒，而不是让大量内核花费大量 CPU 时间来处理同一条缓存线.无锁在低竞争中非常有用案例.)

(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)

在加载以获得expected 和运行lock cmpxchg 和对该值的几条指令的结果之间，CPU 丢失缓存线的窗口很小.通常第一次通过就会成功，因为当cmpxchg运行时，缓存行仍然存在于L1d缓存中.当缓存线到达时，它(希望)已经处于 MESI Exclusive 状态，如果 CPU 看到足够远的距离来为它做一个 RFO.

The window for the CPU to lose the cache line between the load to get expected and running lock cmpxchg with the result of a couple instructions on that value is tiny. Usually it will succeed the first time through, because the cache line will still be present in L1d cache when the cmpxchg runs. When the cache line arrives, it will (hopefully) already be in MESI Exclusive state, if the CPU saw far enough ahead to do an RFO for it.

您可以检测您的 cmpxchg 重试循环，以查看您在实际程序中实际获得的争用数量.(例如，通过在循环内增加一个局部变量，并在成功后使用原子宽松的 += 到共享计数器中，或者使用线程局部计数器.)

You can instrument your cmpxchg retry loops to see how much contention you actually get in your real program. (e.g. by incrementing a local inside the loop and using an atomic relaxed += into a shared counter once you succeed, or with thread-local counters.)

请记住，您的真实代码(希望如此)在此位掩码上的原子操作之间做了大量工作，因此它与所有线程都花所有时间锤击该缓存行的微基准测试非常不同.

Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.

更新必须是原子的，并且必须保证全局进度，但就像上面的解决方案一样，它不必是等待免费算法(当然我会如果你能给我看，我很高兴).

CAS 重试循环(即使在 LL/SC 机器上编译，见下文)在技术意义上是无锁的:您只需在其他线程取得进展时重试.

如果失败，CAS 将保持缓存行不变.在 x86 上它弄脏了缓存线(MESI M 状态)，但 x86 cmpxchg 不检测 ABA，它只进行比较，因此加载相同 expected 的另一个线程将成功.在 LL/SC 机器上，希望一个核心上的 SC 故障不会导致其他核心上的意外 SC 故障，否则可能会发生活锁.您可以假设 CPU 架构师会想到这一点.

CAS leaves the cache line unmodified if it fails. On x86 it dirties the cache line (MESI M state), but x86 cmpxchg doesn't detect ABA, it only compares, so one other thread that loaded the same expected will succeed. On LL/SC machines, hopefully an SC failure on one core doesn't cause surious SC failures on other cores, otherwise it could livelock. That's something you can assume CPU architects thought of.

它不是无等待的，因为理论上单个线程可以无限次重试.

It's not wait-free because a single thread can in theory have to retry an unbounded number of times.

您的代码编译以 gcc8.1 -O3 -march = Haswell的到该ASM()

Your code compiles with gcc8.1 -O3 -march=haswell to this asm (from the Godbolt compiler explorer)

# gcc's code-gen for x86 with BMI1 looks optimal to me.  No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic<unsigned long>&):
    mov     rax, QWORD PTR [rdi]
.L2:
    blsr    rdx, rax                      # Bit Lowest Set Reset
    lock cmpxchg    QWORD PTR [rdi], rdx
    jne     .L2                           # fall through on success: cmpxchg sets ZF on equal like regular cmp
    blsi    rax, rax                      # Bit Lowest Set Isolate
    ret

如果没有 BMI1，blsr 和 blsi 各变成两条指令.考虑到 locked 指令的成本，这与性能几乎无关.

Without BMI1, blsr and blsi become two instructions each. This is close to irrelevant for performance given the cost of a locked instruction.

不幸的是，clang 和 MSVC 稍微笨重一些，在无争用快速路径上有一个分支.(并且 clang 通过剥离第一次迭代使函数膨胀.IDK 如果这有助于分支预测或其他东西，或者它纯粹是遗漏的优化.我们可以使用 unlikely() 宏.Linux 内核中的 possible() 和不太可能的() 宏是如何工作的，它们的好处是什么?).

clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely() macro. How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?).

脚注 1:

除非您正在为一组已知的机器构建二进制文件，否则您不能假设 BMI1 无论如何都可用.所以编译器需要做一些类似lea rdx, [rax-1]/and rdx, rax 而不是blsr rdx, rax 的事情.这对这个函数来说是一个微不足道的区别；大部分成本是原子操作，即使在无竞争的情况下，即使对于热 L1d 缓存没有争用的情况，查看乱序执行吞吐量对周围代码的影响.(例如，Skylake 上 lock cmpxchg 的 10 uops (http://agner.org/optimize/) vs. 使用 blsr 而不是 2 个其他指令节省 1 uop.假设前端是瓶颈，而不是内存或其他东西.作为一个完整的内存屏障会影响也可以在周围的代码中加载/存储，但幸运的是不是独立 ALU 指令的乱序执行.)

Unless you're building binaries for a known set of machines, you can't assume BMI1 is available anyway. So the compiler will need to do something like lea rdx, [rax-1] / and rdx, rax instead of blsr rdx, rax. This makes a trivial difference for this function; the majority of the cost is the atomic operation even in the uncontended case, even for the hot-in-L1d cache no contention case, looking at the out-of-order execution throughput impact on surrounding code. (e.g. 10 uops for lock cmpxchg on Skylake (http://agner.org/optimize/) vs. saving 1 uop with blsr instead of 2 other instructions. Assuming the front-end is the bottleneck, rather than memory or something else. Being a full memory barrier has an impact on loads/stores in surrounding code, too, but fortunately not on out-of-order execution of independent ALU instructions.)

大多数非 x86 机器使用加载链接/存储条件重试循环执行所有原子操作.有点不幸的是，C++11 不允许您创建自定义的 LL/SC 操作(例如，在 LL/SC 中使用 (x-1) & x 而不仅仅是添加您可以通过使用 fetch_add 获得，但是 CAS 机器(如 x86)无法为您提供 LL/SC 提供的 ABA 检测.因此，不清楚您将如何设计一个 C++ 类，该类既可以在 x86 上高效编译，又可以直接编译为 ARM 和其他 LL/SC ISA 上的 LL/SC 重试循环.(参见此讨论.)

Most non-x86 machines do all their atomic operations with load-linked / store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x inside an LL/SC instead of just the add that you'd get from using fetch_add), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)

因此，如果 C++ 没有提供您想要的操作(例如 fetch_or 或 fetch_and)，您只需使用 compare_exchange_weak 编写代码.

So you just have to write code using compare_exchange_weak if C++ doesn't provide the operation you want (e.g. fetch_or or fetch_and).

您在当前编译器的实践中得到的是使用 LL/SC 实现的 compare_exchange_weak，并且您的算术运算与此分离.asm 实际上在独占加载获取 (ldaxr) 之前执行松弛加载，而不是仅基于 ldaxr 结果进行计算.在尝试存储之前，还有额外的分支来检查上次尝试的 expected 是否与新的加载结果匹配.

What you get in practice from current compilers is a compare_exchange_weak implemented with LL/SC, and your arithmetic operation separate from that. The asm actually does the relaxed load before the exclusive-load-acquire (ldaxr), instead of just basing the computation on the ldaxr result. And there's extra branching to check that expected from the last attempt matches the new load result before attempting the store.

LL/SC 窗口可能比加载和存储之间的 2 个相关 ALU 指令更短，以防万一.CPU 提前准备好所需的值，而不依赖于加载结果.(这取决于 previous 加载结果.)Clang 的 code-gen 将 sub/and 放在循环内，但取决于前一次迭代的加载，因此即使不按顺序执行，它们仍然可以提前准备就绪.

The LL/SC window is maybe shorter than with 2 dependent ALU instructions between the load and store, in case that matters. The CPU has the desired value ready ahead of time, not dependent on the load result. (It depends on the previous load result.) Clang's code-gen puts the sub/and inside the loop, but dependent on the previous iteration's load, so with out of order execution they can still be ready ahead of time.

但如果这真的是最有效的做事方式，编译器也应该将它用于 fetch_add.他们不会，因为它可能不是.LL/SC 重试很少见，就像 x86 上的 CAS 重试一样.我不知道它是否会在竞争非常激烈的情况下有所不同.也许吧，但编译器在编译 fetch_add 时不会减慢优化的快速路径.

But if that was really the most efficient way to do things, compilers should be using it for fetch_add as well. They don't because it probably isn't. LL/SC retries are rare, just like CAS retries on x86. I don't know if it could make a different for very-high-contention situations. Maybe, but compilers don't slow down the fast path to optimize for that when compiling fetch_add.

我重命名了您的函数并重新格式化了 while() 以提高可读性，因为如果没有用 unlikely() 将其包装起来，一行已经太长了.

I renamed your functions and re-formatted the while() for readability, because it was already too long for one line without wrapping it with unlikely().

这个版本编译成可能比你的原始版本稍微好一点的 asm，带有 clang.我还修复了您的函数名称，因此它实际上可以编译；你的问题不匹配.我选择了与 x86 的 BMI 指令名称相似的完全不同的名称，因为它们简洁地描述了操作.

This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.

#include <atomic>
#include <cstdint>

#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#endif

inline uint64_t clear_lowest_set(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t isolate_lowest_set(std::uint64_t v){
    //return v & ~clear_lowest_set(v);
    return (-v) & v;
    // MSVC optimizes this better for ARM when used separately.
    // The other way (in terms of clear_lowest_set) does still CSE to 2 instructions
    //  when the clear_lowest_set result is already available
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(unlikely(!foo.compare_exchange_weak(
        expected, clear_lowest_set(expected),
        std::memory_order_seq_cst, std::memory_order_relaxed)))
        {}

    return isolate_lowest_set(expected);
}

Clang -O3 用于 AArch64(Godbolt 上的 -target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57)生成了一些奇怪的代码.这是来自 clang6.0，但旧版本也有奇怪之处，在寄存器中创建一个 0/1 整数，然后对其进行测试，而不是一开始就跳到正确的位置.

Clang -O3 for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57 on Godbolt) makes some weird code. This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.

pop_lowest_non_zero(std::__1::atomic<unsigned long long>&): // @pop_lowest_non_zero(std::__1::atomic<unsigned long long>&)

    ldr     x9, [x0]                   @ the relaxed load
    ldaxr   x8, [x0]                   @ the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .LBB0_3
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest( relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .LBB0_4             @ if(SC failed) goto retry loop

          # else fall through and eventually reach the epilogue

    # this looks insane.  w10 = 0|1, then branch if bit[0] != 0.  Always taken, right?
    orr     w10, wzr, #0x1
    tbnz    w10, #0, .LBB0_5         @ test-bit not-zero will always jump to the epilogue
        b       .LBB0_6              # never reached


.LBB0_3:
    clrex                           @ clear the ldrx/stxr transaction
.LBB0_4:
    # This block is pointless; just should b to .LBB0_6
    mov     w10, wzr
    tbz     w10, #0, .LBB0_6    # go to the retry loop, below the ret (not shown here)

.LBB0_5:               # isolate_lowest_set and return
    neg     x8, x9
    and     x0, x9, x8
    ret

.LBB0_6:
     @ the retry loop.  Only reached if the compare or SC failed.
     ...

所有这些疯狂的分支可能不是真正的性能问题，但很明显这可能会更有效率(即使不教 clang 如何直接使用 LL/SC 而不是模拟 CAS).报告为 clang/LLVM 错误 38173](https://bugs.llvm.org/show_bug.cgi?id=38173)

All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS). Reported as clang/LLVM bug 38173](https://bugs.llvm.org/show_bug.cgi?id=38173)

似乎 MSVC 不知道 AArch64 的发布存储是顺序发布(不仅仅是像 x86 这样的常规发布)，因为它在 dmb ish 指令(全内存屏障)之后仍然使用 stlxr.它有更少的浪费指令，但它的 x86 偏差正在显示或其他东西:compare_exchange_weak 像 compare_exchange_strong 一样编译，并带有一个可能无用的重试循环，该循环将再次尝试使用旧的 CAS预期/期望，在 LL/SC 故障时.这通常是因为另一个线程修改了该行，因此预期会不匹配.(Godbolt 在任何旧版本中都没有用于 AArch64 的 MSVC，所以可能支持是全新的.这可能解释了 dmb)

It seems MSVC doesn't know that AArch64's release-stores are sequential-release (not just regular release like x86), because it's still using a dmb ish instruction (full memory barrier) after stlxr. It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak compiles like compare_exchange_strong with a probably-useless retry loop that will try just the CAS again with the old expected/desired, on LL/SC failure. That will usually be because another thread modified the line, so expected will mismatch. (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new. That might explain the dmb)

   ## MSVC Pre 2018 -Ox
|pop_lowest_non_zero| PROC
    ldr         x10,[x0]          # x10 = expected = foo.load(relaxed)

|$LL2@pop_lowest|           @ L2  # top of the while() loop
    sub         x8,x10,#1
    and         x11,x8,x10        # clear_lowest(relaxed load result)
|$LN76@pop_lowest|          @ LN76
    ldaxr       x8,[x0]
    cmp         x8,x10            # the compare part of CAS
    bne         |$LN77@pop_lowest|
    stlxr       w9,x11,[x0]
    cbnz        w9,|$LN76@pop_lowest|  # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong
     # fall through on LL/SC success

|$LN77@pop_lowest|          @ LN77
    dmb         ish                # full memory barrier: slow
    cmp         x8,x10             # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths.
    beq         |$LN75@pop_lowest| # if(expected matches) goto epilogue
    mov         x10,x8             # else update expected
    b           |$LL2@pop_lowest|  # and goto the top of the while loop

|$LN75@pop_lowest|          @ LN75   # function epilogue
    neg         x8,x10
    and         x0,x8,x10
    ret

gcc6.3 for AArch64 也生成了奇怪的代码，将 expected 存储到堆栈中.(Godbolt 没有更新的 AArch64 gcc).

gcc6.3 for AArch64 makes weird code, too, storing expected to the stack. (Godbolt doesn't have newer AArch64 gcc).

我对 AArch64 编译器对此非常不满意！IDK 为什么他们不生成像这样干净和高效的东西，在快速路径上没有采取分支，只有一些外线代码可以设置跳回 CAS 重试.

I'm very unimpressed with AArch64 compilers for this! IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.

pop_lowest_non_zero ## hand written based on clang
                    # clang could emit this even without turning CAS into LL/SC directly

    ldr     x9, [x0]                   @ x9 = expected = foo.load(relaxed)
 .Lcas_retry:
    ldaxr   x8, [x0]                   @ x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .Lcas_fail
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest (relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .Lllsc_fail

    # LL/SC success: isolate_lowest_set and return
.Lepilogue:
    neg     x8, x9
    and     x0, x9, x8
    ret

.Lcas_fail:
    clrex                           @ clear the ldrx/stxr transaction
.Lllsc_fail:
    mov     x9, x8                  @ update expected
    b     .Lcas_retry           @ instead of duplicating the loop, jump back to the existing one with x9 = expected

与`.fetch_add`对比:

Clang 确实为 fetch_add() 编写了很好的代码:

Compare with `.fetch_add`:

Clang does make nice code for fetch_add():

    mov     x8, x0
.LBB1_1:
    ldxr    x0, [x8]                # relaxed exclusive load: I used mo_release
    add     x9, x0, #1
    stlxr   w10, x9, [x8]
    cbnz    w10, .LBB1_1            # retry if LL/SC failed
    ret

我们想要 sub x9, x8, #1/and x9, x9, x8 而不是 add #1，所以我们只得到一个 LL/SC 重试循环.这可以节省代码大小，但不会显着加快.在大多数情况下，甚至可能没有明显更快的速度，尤其是作为整个程序的一部分，它没有被大量使用.

Instead of add #1, we'd like to have sub x9, x8, #1 / and x9, x9, x8, so we just get a LL/SC retry loop. This saves code-size, but it won't be significantly faster. Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.

是否可以使用原子计数器代替位图，并在需要时将其映射到位图?需要位图的操作可以使用 uint64_t(~0ULL) (仅适用于非零计数器).或者也许 tmp=1ULL << (即 x86 xor eax,eax/bts rax, rdi (rax=1 在位置 0..63 设置位)/blsmsk rax, rax(rax=所有位设置为那个位置.嗯，对于掩码=0 仍然需要一个特殊情况，因为连续位掩码有 65 种可能的状态:最高(或最低)位位于 64 个位置之一，或者根本没有设置位.

Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed? Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) << (64-counter) (for non-zero counter only). Or maybe tmp=1ULL << counter; tmp ^= tmp-1; (i.e. x86 xor eax,eax / bts rax, rdi (rax=1 set bit at position 0..63) / blsmsk rax, rax (rax=all bits set up to that position). Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.

或者如果位图有某种模式，x86 BMI2 pdep 可以将连续位分散到该模式中.使用 (1ULL < 获取 N 个连续的设置位，再次用于 counter 64.

Or if there's some pattern to the bitmap, x86 BMI2 pdep can scatter contiguous bits into that pattern. Get N contiguous set bits with (1ULL << counter) - 1, again for counter < 64.

或者它可能必须是位掩码，但不需要是单个位掩码?

不知道您的用例是什么，但这种想法可能有用:

No idea what your use-case is, but this kind of idea might be useful:

您是否需要一个所有线程都必须争夺的单个原子位图?也许您可以将其分成多个块，每个块在一个单独的缓存行中.(但这使得不可能对整个地图进行原子快照.)不过，如果每个线程都有一个首选块，并且在继续寻找另一个块中的插槽之前总是尝试它，那么在一般情况下，您可能会减少争用.

Do you need a single atomic bitmap that all threads have to contend for? Perhaps you could break it into multiple chunks, each in a separate cache line. (But that makes it impossible to atomically snapshot the entire map.) Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.

在 asm 中(或在 C++ 中使用可怕的联合黑客)，您可以尝试通过找到要更新的 64 位整数的正确字节来减少 cmpxchg 失败而不减少争用，然后仅 lock cmpxchg 就可以了.这对这种情况实际上没有帮助，因为看到相同整数的两个线程都将尝试清除相同的位.但它可以减少此操作与设置 qword 中其他位的操作之间的重试.当然，这只适用于不是lock cmpxchg在qword的其他字节发生变化时成功的正确性问题.

In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg on it. That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit. But it could reduce retries between this and something that sets other bits in the qword. Of course this only works if it's not a correctness problem for lock cmpxchg to succeed when other bytes of the qword have changed.

                        这篇关于原子清除无符号整数的最低非零位的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

the