编译器选择不使用REP MOVSB指令进行字节数组移动

本文介绍了编译器选择不使用REP MOVSB指令进行字节数组移动的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在检查使用最新版本的VS 2017 C ++编译器完成的项目的Release版本.而且我很好奇编译器为什么选择构建以下代码片段:

I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:

//ncbSzBuffDataUsed of type INT32

UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
    pDst[i] = pSrc[i];
}

例如:

        UINT8* pDst = (UINT8*)(pMXB + 1);
        UINT8* pSrc = (UINT8*)pDPE;
        for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2             movsxd      r8,edx  
00007FF664412521 4C 2B D1             sub         r10,rcx  
00007FF664412524 0F 1F 40 00          nop         dword ptr [rax]  
00007FF664412528 0F 1F 84 00 00 00 00 00 nop         dword ptr [rax+rax]  

00007FF664412530 41 0F B6 04 0A       movzx       eax,byte ptr [r10+rcx]  
        {
            pDst[i] = pSrc[i];
00007FF664412535 88 01                mov         byte ptr [rcx],al  
00007FF664412537 48 8D 49 01          lea         rcx,[rcx+1]  
00007FF66441253B 49 83 E8 01          sub         r8,1  
00007FF66441253F 75 EF                jne         _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)  
        }

仅使用一个 REP MOVSB 指示?后者会不会更有效率?

versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?

推荐答案

编辑:首先，有一个rep movsb的内在函数，Peter Cordes告诉我们在这里要快得多，我相信他(我想我已经做到了).如果要强制编译器以这种方式执行操作，请参见:__movsb(): https://docs.microsoft.com/zh-cn/cpp/intrinsics/movsb .

First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://docs.microsoft.com/en-us/cpp/intrinsics/movsb.

关于编译器为何不为您执行此操作的原因，在没有其他想法的情况下，答案可能是寄存器压力.要使用rep movsb，编译器必须:

As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:

设置rsi(=源地址)
设置rdi(=目标地址)
设置rcx(=计数)
发出rep movsb

set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb

因此，现在它不得不用完rep movsb指令要求的三个寄存器，并且它可能更愿意不这样做.特别是rsi和rdi预计将在整个函数调用中保留，因此，如果编译器可以避免在任何特定函数的主体中使用它们，并且(至少在最初进入该方法时) rcx保留this指针.

So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.

此外，使用我们看到的编译器生成的代码，r10和rcx寄存器可能已经包含必需的源地址和目标地址(从您的示例中看不到)，这将是如果这样的话，对编译器很方便.

Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.

实际上，您可能会看到编译器在不同情况下做出不同选择.要求的优化类型(/O1-优化尺寸，/O2-优化速度)也可能会影响此效果.

In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.

有关x64寄存器传递约定的更多信息，此处，以及x64 ABI通常在此处.

More on the x64 register passing convention here, and on the x64 ABI generally here.

编辑2 (同样受到Peter的评论启发):

Edit 2 (again inspired by Peter's comments):

编译器可能决定不对循环进行矢量化处理，因为它不知道指针是否对齐或可能重叠.没有看到更多的代码，我们无法确定.但是，鉴于OP实际提出的要求，这与我的回答并不严格相关.

The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.

这篇关于编译器选择不使用REP MOVSB指令进行字节数组移动的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！