问题描述
我了解到, memset的(PTR,0,为nbytes)
实在是快,但有一个更快的方法(至少在x86上)?
I learned that memset(ptr, 0, nbytes)
is really fast, but is there a faster way (at least on x86)?
我认为memset的用途 MOV
,归零内存大多数编译器使用 XOR
,因为它的速度更快,正确但是什么时候呢? EDIT1:错误的,因为GregS指出,只有登记工作。我当时在想什么?
I assume that memset uses mov
, however when zeroing memory most compilers use xor
as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?
此外,我问,谁知道汇编的比我多看STDLIB一个人,他告诉我说,在x86 memset的未服用32位宽的寄存器中的优势。然而,在那个时候我非常累,所以我不太确定我理解正确的话。
Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.
EDIT2 :
我重新审视这个问题,并做了一个小测试。以下是我的测试:
edit2:I revisited this issue and did a little testing. Here is what I tested:
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/time.h>
#define TIME(body) do { \
struct timeval t1, t2; double elapsed; \
gettimeofday(&t1, NULL); \
body \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \
#define SIZE 0x1000000
void zero_1(void* buff, size_t size)
{
size_t i;
char* foo = buff;
for (i = 0; i < size; i++)
foo[i] = 0;
}
/* I foolishly assume size_t has register width */
void zero_sizet(void* buff, size_t size)
{
size_t i;
char* bar;
size_t* foo = buff;
for (i = 0; i < size / sizeof(size_t); i++)
foo[i] = 0;
// fixes bug pointed out by tristopia
bar = (char*)buff + size - size % sizeof(size_t);
for (i = 0; i < size % sizeof(size_t); i++)
bar[i] = 0;
}
int main()
{
char* buffer = malloc(SIZE);
TIME(
memset(buffer, 0, SIZE);
);
TIME(
zero_1(buffer, SIZE);
);
TIME(
zero_sizet(buffer, SIZE);
);
return 0;
}
结果:
目前memset的胜利,previous结果由CPU缓存扭曲。 (所有测试都是在Linux上运行),进一步的测试需要。我会尽力在下汇编:)
For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)
EDIT3:测试code修正了,测试结果不会受到影响。
edit3: fixed bug in test code, test results are not affected
edit4:虽然各地的拆解VS2010 C运行时戳,我注意到 memset的
有一个SSE优化例程为零。这将是很难被击败这一点。
edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset
has a SSE optimized routine for zero. It will be hard to beat this.
推荐答案
86是相当广泛的设备。
x86 is rather broad range of devices.
有关完全通用的x86目标,以REP MOVSD组装块能32位在时间吼出零到内存。努力确保这项工作的大部分是DWORD对齐。
For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.
有关与MMX芯片,组装循环与MOVQ可以一次打64位。
For chips with mmx, an assembly loop with movq could hit 64bits at a time.
您也许能得到一个C / C ++编译器使用一个64位写的指针long long或_m64。目标必须为最佳性能8字节对齐的。
You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.
与上证所芯片,MOVAPS是快,但只有当地址为16字节对齐,所以使用MOVSB直到对齐,然后完整填写清楚与MOVAPS的环
for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps
Win32的有ZeroMemory(),但我忘了如果多数民众赞成宏memset的,或实际'好'的实现。
Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.
这篇关于更快的方式零内存比memset的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!