问题描述
我正在检查使用最新版本的VS 2017 C ++编译器完成的项目的Release版本.而且我很好奇编译器为什么选择构建以下代码片段:
I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
例如:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
仅使用一个 REP MOVSB
指示?后者会不会更有效率?
versus just using a single REP MOVSB
instruction? Wouldn't the latter be more efficient?
推荐答案
编辑:首先,有一个rep movsb
的内在函数,Peter Cordes告诉我们在这里要快得多,我相信他(我想我已经做到了).如果要强制编译器以这种方式执行操作,请参见:__movsb()
: https://docs.microsoft.com/zh-cn/cpp/intrinsics/movsb .
First up, there's an intrinsic for rep movsb
which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb()
: https://docs.microsoft.com/en-us/cpp/intrinsics/movsb.
关于编译器为何不为您执行此操作的原因,在没有其他想法的情况下,答案可能是寄存器压力.要使用rep movsb
,编译器必须:
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb
The compiler would have to:
- 设置
rsi
(=源地址) - 设置
rdi
(=目标地址) - 设置
rcx
(=计数) - 发出
rep movsb
- set up
rsi
(= source address) - set up
rdi
(= destination address) - set up
rcx
(= count) - issue the
rep movsb
因此,现在它不得不用完rep movsb
指令要求的三个寄存器,并且它可能更愿意不这样做.特别是rsi
和rdi
预计将在整个函数调用中保留,因此,如果编译器可以避免在任何特定函数的主体中使用它们,并且(至少在最初进入该方法时) rcx
保留this
指针.
So now it has had to use up the three registers mandated by the rep movsb
instruction, and it may prefer not to do that. Specifically rsi
and rdi
are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx
holds the this
pointer.
此外,使用我们看到的编译器生成的代码,r10
和rcx
寄存器可能已经包含必需的源地址和目标地址(从您的示例中看不到),这将是如果这样的话,对编译器很方便.
Also, with the code that we see the compiler has generated there, the r10
and rcx
registers might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
实际上,您可能会看到编译器在不同情况下做出不同选择.要求的优化类型(/O1
-优化尺寸,/O2
-优化速度)也可能会影响此效果.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1
- optimise for size, vs /O2
- optimise for speed) will likely also affect this.
有关x64寄存器传递约定的更多信息,此处,以及x64 ABI通常在此处.
More on the x64 register passing convention here, and on the x64 ABI generally here.
编辑2 (同样受到Peter的评论启发):
Edit 2 (again inspired by Peter's comments):
编译器可能决定不对循环进行矢量化处理,因为它不知道指针是否对齐或可能重叠.没有看到更多的代码,我们无法确定.但是,鉴于OP实际提出的要求,这与我的回答并不严格相关.
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
这篇关于编译器选择不使用REP MOVSB指令进行字节数组移动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!