问题描述
我正在用 C 语言进行图像处理,需要在内存周围复制大量数据 - 源和目标永远不会重叠.
I am doing image processing in C that requires copying large chunks of data around memory - the source and destination never overlap.
在 x86 平台上使用 GCC 执行此操作的绝对最快方法是什么(其中 SSE,SSE2 但不是 SSE3 可用)?
What is the absolute fastest way to do this on the x86 platform using GCC (where SSE, SSE2 but NOT SSE3 are available)?
我希望解决方案是汇编或使用 GCC 内在函数?
I expect the solution will either be in assembly or using GCC intrinsics?
我找到了以下链接,但不知道这是否是最好的方法(作者还说它有一些错误):http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html
I found the following link but have no idea whether it's the best way to go about it (the author also says it has a few bugs): http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html
请注意,副本是必要的,我无法避免必须复制数据(我可以解释原因,但我不会给你解释:))
note that a copy is necessary, I cannot get around having to copy the data (I could explain why but I'll spare you the explanation :))
推荐答案
由 William Chan 提供 和谷歌.比 Microsoft Visual Studio 2005 中的 memcpy 快 30-70%.
Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.
void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{
__asm
{
mov esi, src; //src pointer
mov edi, dest; //dest pointer
mov ebx, size; //ebx is our counter
shr ebx, 7; //divide by 128 (8 * 128bit registers)
loop_copy:
prefetchnta 128[ESI]; //SSE2 prefetch
prefetchnta 160[ESI];
prefetchnta 192[ESI];
prefetchnta 224[ESI];
movdqa xmm0, 0[ESI]; //move data from src to registers
movdqa xmm1, 16[ESI];
movdqa xmm2, 32[ESI];
movdqa xmm3, 48[ESI];
movdqa xmm4, 64[ESI];
movdqa xmm5, 80[ESI];
movdqa xmm6, 96[ESI];
movdqa xmm7, 112[ESI];
movntdq 0[EDI], xmm0; //move data from registers to dest
movntdq 16[EDI], xmm1;
movntdq 32[EDI], xmm2;
movntdq 48[EDI], xmm3;
movntdq 64[EDI], xmm4;
movntdq 80[EDI], xmm5;
movntdq 96[EDI], xmm6;
movntdq 112[EDI], xmm7;
add esi, 128;
add edi, 128;
dec ebx;
jnz loop_copy; //loop please
loop_copy_end:
}
}
您可以根据您的具体情况和您能够做出的任何假设进一步优化它.
You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.
您可能还想查看 memcpy 源代码 (memcpy.asm) 并去除其特殊情况处理.或许可以进一步优化!
You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!
这篇关于用于图像处理的非常快的 memcpy?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!