




I am currently looking into malloc() implementation under Windows. But in my research I have stumbled upon things that puzzled me:


First, I know that at the API level, windows uses mostly the HeapAlloc() and VirtualAlloc() calls to allocate memory. I gather from here that the Microsoft implementation of malloc() (that which is included in the CRT - the C runtime) basically calls HeapAlloc() for blocks > 480 bytes and otherwise manage a special area allocated with VirtualAlloc() for small allocations, in order to prevent fragmentation.

这一切都很好.但是还有malloc()的其他实现,例如 nedmalloc ,声称比微软的malloc快125%.

Well that is all good and well. But then there are other implementation of malloc(), for instance nedmalloc, which claim to be up to 125% faster than Microsoft's malloc.


All this makes me wonder a few things:

  1. 为什么我们不能只为小块调用HeapAlloc()?在碎片化方面是否表现不佳(例如,通过第一适应"而不是最佳适应")?

  1. Why can't we just call HeapAlloc() for small blocks? Does is perform poorly in regard to fragmentation (for example by doing "first-fit" instead of "best-fit")?

  • 实际上,有什么方法可以知道各种API分配调用背后的情况吗?那会很有帮助.


What makes nedmalloc so much faster than Microsoft's malloc?

从上面,我得到的印象是HeapAlloc()/VirtualAlloc()是如此之慢,以至于malloc()偶尔仅调用一次然后管理分配的内存本身要快得多.这个假设是真的吗?还是由于碎片而仅需要malloc()包装器"? 人们会认为这样的系统调用会很快-或至少会考虑一些想法以提高效率.

From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true? Or is the malloc() "wrapper" just needed because of fragmentation? One would think that system calls like this would be quick - or at least that some thoughts would have been put into them to make them efficient.

  • 如果是真的,为什么会这样?


On average, how many (an order of magnitude) memory reads/write are performed by a typical malloc call (probably a function of the number of already allocated segments)? I would intuitively says it's in the tens for an average program, am I right?


  1. 调用HeapAlloc听起来并不跨平台. MS可以根据需要自由更改其实现;建议远离. :)
  2. 它可能更有效地使用内存池,就像Loki库的小对象分配器"一样.
  3. 堆分配本质上是通用的,但通过任何实现总是很慢.分配器越专业化",它将越快.这使我们返回到第二点,该点处理内存池(以及针对您的应用程序使用的分配大小).
  4. 不知道.


09-05 09:02