本文介绍了通过内联汇编各地存储器操作锁的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的低层次的东西,所以我完全无视你可能面临的下跌有什么样的问题,我什至不知道如果我理解术语原子的权利。现在我试图让通过扩展组件绕存储器操作简单的原子锁。为什么?对于好奇的缘故。我知道我在这里重新发明轮子和可能过于简单化的全过程。

这个问题?
这里是否code我present achive使得存储器操作两个线程和折返的目标是什么?


  • 如果一切正常,为什么?

  • 如果它不工作,为什么?

  • 不够好?我应该例如利用的注册的关键字的使用C?

我只是想做...


  • 存储器操作之前,锁定。

  • 存储器操作后,解锁。

的code:

挥发性INT atomic_gate_memory = 0;静态内嵌无效atomic_open(挥发性INT *门)
{
    ASM挥发性(
        等待:\\ n
        CMP%[锁定]%[门] \\ n
        济等待\\ n
        MOV%[锁定]%[门] \\ n
        :[门]= M(*门)
        :[锁定]R(1)
    );
}静态内嵌无效atomic_close(挥发性INT *门)
{
    ASM挥发性(
        MOV%[锁定]%[门] \\ n
        :[门]= M(*门)
        :[锁定]R(0)
    );
}

然后是这样的:

无效* _malloc(为size_t大小)
{
        atomic_open(安培; atomic_gate_memory);
        void *的纪念品=的malloc(大小);
        atomic_close(安培; atomic_gate_memory);
        返回MEM;
}
#定义的malloc(大小)_malloc(尺寸)

..同为释放calloc,realloc的,自由和叉(对于Linux)。

#IFDEF _UNISTD_H
INT _fork()
{
        将为pid_t PID;
        atomic_open(安培; atomic_gate_memory);
        PID =叉();
        atomic_close(安培; atomic_gate_memory);
        返回PID;
}
#定义fork()的_fork()
#万一

加载的StackFrame为atomic_open后,objdump的产生:

00000000004009a7<等待计算值:
4009a7:39 10 CMP%EDX,(RAX%)
4009a9:74 FC JE 4009a7<等待>
4009ab:89 10 MOV%EDX,(RAX%)

另外,鉴于上述的拆卸;我可以假设我在做一个原子操作,因为它是唯一一个指令?


解决方案

register is a meaningless hint in modern optimizing compilers.


I think a simple spinlock that doesn't have any of the really major / obvious performance problems on x86 is something like:

Implement as much of this as you like in inline asm, or preferably using C11 stdatomic, like this semaphore implementation.

;;;;; UNTESTED ;;;;;;;;
;void spin_lock  (volatile char *lock)
;void spin_unlock(volatile char *lock)

    ; first arg in rdi, in the AMD64 SysV ABI

global spin_unlock
spin_unlock:
    ;; debug: check that the old value was non-zero.  double-unlocking is a nasty bug
    mov   byte [rdi], 0
    ret


global spin_lock
spin_lock:
    cmp   byte [rdi], 0           ; avoid writing to the cache line if we don't own the lock: should speed up the other thread unlocking
    jnz   .spinloop

    mov   al, 1                   ; only need to do this the first time, otherwise we know al is non-zero
.retry:
    xchg  al, [rdi]

    test  al,al
    jnz   .spinloop
    ret                           ; no taken branches on the fast-path

.spinloop:
    pause                     ; very old CPUs decode it as REP NOP, which is fine
    cmp   byte [rdi], 0       ; To get a compiler to do this in C++11, use a memory_order_acquire load
    jnz   .spinloop
    jmp   .retry


If you were using a bitfield of atomic flags, you could use lock bts (test and set) for the equivalent of xchg-with-1. You can spin on bt. To unlock, you'd need lock btr, not just btr, because it would be a non-atomic read-modify-write of the byte, or even the containing 32bits.


This avoids writing to the lock if we see it's already locked. That way, the cache line containing the lock can stay in the "Owned" state (of MOESI) on the core running the thread that owns it. We also don't flood the CPU with locked operations in a loop. I'm not sure how much this slows things down in general, but 10 threads all waiting for the same spinlock will keep the memory arbitration hardware pretty busy. This might slow down the thread that does hold the lock, or other unrelated threads on the system, while they use other locks, or memory in general.

PAUSE is also essential, to avoid mis-speculation about memory ordering by the CPU. You exit the loop only when the memory you're reading was modified by another core. However, we don't want to pause in the un-contended case. On Skylake, PAUSE waits a lot longer, like ~100cycles IIRC, so you should definitely keep the spinloop separate from the initial check for unlocked.

I'm sure Intel's and AMD's optimization manuals talk about this, see the x86 tag wiki for that and tons of other links.

这篇关于通过内联汇编各地存储器操作锁的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 15:14
查看更多