本文介绍了如何使用重复的字节值填充 64 位寄存器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Visual C++ 2010 和 masm(快速调用"调用约定)进行一些 x64 汇编.

I'm doing some x64 assembly with Visual C++ 2010 and masm ('fast call' calling convention).

假设我有一个 C++ 函数:

So let's say I have a function in C++:

extern "C" void fillArray(unsigned char* byteArray, unsigned char value);

指向数组的指针将在 RCX 中,而字符值将在 DL 中

The pointer to array will be in RCX and char value will be in DL

如何使用 DL 用值填充 RAX,以便如果我要 mov qword ptr [RCX], RAX 并打印 byteArray,所有值都将等于 'char value'?

How can I fill RAX with values using DL such that if I were to mov qword ptr [RCX], RAX and print byteArray, all the values would be equal to 'char value'?

请注意,我不是在尝试编写编译器的代码,我只是在学习.

Please note that I'm not trying to do out-code my compiler, I'm just learning.

推荐答案

因为您将过程称为fillArray",所以我假设您喜欢用字节值填充整个内存块.所以我对不同的方法进行了比较.它是 32 位 masm 代码,但在 64 位模式下结果应该是相似的.每种方法都使用对齐和未对齐的缓冲区进行测试.结果如下:

Because you called your procedure 'fillArray', I assumed you like to fill a whole memory block with a byte value. So I did a comparision on different approaches. It is 32 bit masm code, but the results should be similar in 64 bit mode. Each approach is tested with both aligned and unaligned buffers. Here are the results:

Simple REP STOSB - aligned....: 192
Simple REP STOSB - not aligned: 192
Simple REP STOSD - aligned....: 191
Simple REP STOSD - not aligned: 222
Simple while loop - aligned....: 267
Simple while loop - not aligned: 261
Simple while loop with different addressing - aligned....: 271
Simple while loop with different addressing - not aligned: 262
Loop with 16-byte SSE write - aligned....: 192
Loop with 16-byte SSE write - not aligned: 205
Loop with 16-byte SSE write non-temporal hint - aligned....: 126 (EDIT)

使用以下代码的最简单的变体似乎在两种情况下都表现最佳,并且代码量也最小:

The most naive variant using the following code seems to perform best in both scenarios and has the smallest code size as well:

cld
mov al, 44h   ; byte value
mov edi, lpDst
mov ecx, 256000*4  ; buf size
rep stosb

对齐数据不是最快的.添加了性能最好的 MOVNTDQ 版本,见下文.

It's not the fastest for aligned data. Added MOVNTDQ version which performs best, see below.

为了完整起见,这里是其他例程的摘录 - 假设该值之前被扩展为 EAX:

For the sake of completeness, here are excerpts from the other routines - the value is assumed to be expanded into EAX before:

代表斯托斯德:

mov edi, lpDst
mov ecx, 256000
rep stosd

简单的同时:

mov edi, lpDst
mov ecx, 256000
.while ecx>0
    mov [edi],eax
    add edi,4
    dec ecx
.endw

不同的简单while:

Different simple while:

mov edi, lpDst
xor ecx, ecx
.while ecx<256000 
    mov [edi+ecx*4],eax
    inc ecx
.endw

SSE(两者):

movd xmm0,eax
punpckldq xmm0,xmm0    ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0   ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
mov ecx, 256000/4   ; 16 byte
mov edi, lpDst
.while ecx>0 
    movdqa xmmword ptr [edi],xmm0    ; movdqu for unaligned
    add edi,16
    dec ecx
.endw

SSE(NT,对齐,编辑):

SSE(NT,aligned,EDIT):

movd xmm0,eax
punpckldq xmm0,xmm0    ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0   ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
mov ecx, 256000/4   ; 16 byte
mov edi, lpDst
.while ecx>0 
    movntdq xmmword ptr [edi],xmm0
    add edi,16
    dec ecx
.endw

我在这里上传了整个代码 http://pastie.org/9831404 --- MASM 包组装需要来自厨具.

I uploaded the whole code here http://pastie.org/9831404 --- the MASM package from hutch is required for assembling.

如果 SSSE3 可用,您可以使用 pshufb 将一个字节广播到寄存器的所有位置,而不是一系列 punpck 指令.

If SSSE3 is available, you can use pshufb to broadcast a byte to all positions of a register instead of a chain of punpck instructions.

movd    xmm0, edx
xorps   xmm1,xmm1      ; xmm1 = 0
pshufb  xmm0, xmm1     ; xmm0 = _mm_set1_epi8(dl)

这篇关于如何使用重复的字节值填充 64 位寄存器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 16:13