本文介绍了多线程是否强调内存碎片?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 描述 当使用openmp的parallel并行构造来分配和释放具有4个或更多线程的随机大小的内存块时,程序似乎开始泄露大量的内存测试程序运行时的后半部分。因此,它将其消耗的内存从1050 MB增加到1500 MB或更多,而实际上不使用额外的内存。 由于valgrind没有显示任何问题,是一个内存泄漏实际上是一个强调的内存碎片的效果。 有趣的是,如果2个线程每个10000个分配,但它显示强烈if 4个线程每个进行5000次分配。此外,如果分配的块的最大大小减少到256kb(从1mb),效果会变弱。 重并发会强调分段吗? 测试计划描述 演示程序是从堆中获取总共256 MB的随机大小的内存块,进行5000次分配。如果命中了内存限制,首先分配的块将被释放,直到内存消耗低于限制。一旦执行了5000个分配,则释放所有存储器并且循环结束。所有这些工作都是由openmp生成的每个线程完成的。 这种内存分配方案允许我们预计每个线程的内存消耗大约为260 MB(包括一些簿记数据) 。 演示程序 由于这真的是你想测试的东西,你可以下载示例程序具有来自保管箱的简单makefile。 当运行该程序时,您应该至少有1400 MB的RAM可用。您可以随意调整代码中的常量以满足您的需要。 为了完整性,实际代码如下: #include< stdlib.h> #include< stdio.h> #include< iostream> #include< vector> #include< deque> #include< omp.h> #include< math.h> typedef unsigned long long uint64_t; void runParallelAllocTest() { // constants const int NUM_ALLOCATIONS = 5000; // alloc's per thread const int NUM_THREADS = 4; //多少线程? const int NUM_ITERS = NUM_THREADS; //多少次重复 const bool USE_NEW = true; // use new或malloc? ,似乎没有什么区别(因为它应该) const bool DEBUG_ALLOCS = false; //调试输出 //预存储分配大小 const int NUM_PRE_ALLOCS = 20000; const uint64_t MEM_LIMIT =(1024 * 1024)* 256; //每个进程的x MB const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1; srand(1); std :: vector< size_t>分配; allocations.resize(NUM_PRE_ALLOCS); for(int i = 0; i allocations [i] = rand()%MAX_CHUNK_SIZE; //使用高达x MB的块} #pragma omp parallel num_threads(NUM_THREADS) #pragma omp for for(int i = 0; i uint64_t long totalAllocBytes = 0; uint64_t currAllocBytes = 0; std :: deque< std :: pair< char *,uint64_t> >指针; const int myId = omp_get_thread_num(); for(int j = 0; j //新分配 const size_t allocSize = allocations [(myId * 100 + j) %NUM_PRE_ALLOCS]; char * pnt = NULL; if(USE_NEW){ pnt = new char [allocSize]; } else { pnt =(char *)malloc(allocSize); } pointers.push_back(std :: make_pair(pnt,allocSize)); totalAllocBytes + = allocSize; currAllocBytes + = allocSize; //填充值以添加delay for(int fill = 0; fill<(int)allocSize; ++ fill){ pnt [fill] =(char)(j%255); } if(DEBUG_ALLOCS){ std :: cout< Id< myId<< New alloc<< pointers.size()<< ,bytes:< allocSize<< at<< (uint64_t)pnt<< \\\; } //全部或只是一点 if(((j%5)== 0)||(j ==(NUM_ALLOCATIONS - 1))) { int frees = 0; //保持这个分配 //上次检查,所有的 uint64_t memLimit = MEM_LIMIT; if(j == NUM_ALLOCATIONS - 1){ std :: cout<< Id< myId<< 即将释放所有存储器:< (currAllocBytes /(double)(1024 * 1024))< MB< std :: endl; memLimit = 0; } // MEM_LIMIT = 0; // DEBUG while(pointers.size()> 0&&&(currAllocBytes> memLimit)){ //释放第一个条目之一, tolive更长 currAllocBytes - = pointers.front()。second; char * pnt = pointers.front()。first; //可用内存 if(USE_NEW){ delete [] pnt; } else { free(pnt); } //更新数组 pointers.pop_front(); if(DEBUG_ALLOCS){ std :: cout< Id< myId<< Free'd< pointers.size()<< at<< (uint64_t)pnt<< \\\; } frees ++; } if(DEBUG_ALLOCS){ std :: cout< Frees<< frees<< ,< currAllocBytes<< /<< MEM_LIMIT<< ,< totalAllocBytes<< \\\; } } } //对于每个分配 if(currAllocBytes!= 0){ std :: cerr< 不是所有free'd!\\\; } std :: cout<< Id< myId<< done,total alloc'ed< ((double)totalAllocBytes /(double)(1024 * 1024)); } //每次迭代 exit(1); } int main(int argc,char ** argv) { runParallelAllocTest(); return 0; } 测试系统 从我到目前为止,我看到的硬件很重要。如果在更快的计算机上运行,测试可能需要调整。 英特尔®Core™2 Duo CPU T7300 @ 2.00 GHz Ubuntu 10.04 LTS 64位 gcc 4.3,4.4,4.6 3988.62 Bogomips 测试 一旦你执行了makefile,你应该得到一个名为 ompmemtest 的文件。为了查询内存使用情况,我使用了以下命令: ./ ompmemtest& top -b | grep ompmemtest 这会产生令人印象深刻的碎片 4个线程的预期内存消耗为 1090 MB,随着时间变为 1500 MB: PID用户PR NI VIRT RES SHR S%CPU%MEM TIME + COMMAND 11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest 11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest 11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest 11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest 11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest 11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest 11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest 11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest 11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest 11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest 11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest 11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest 11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest 11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest 请注意:使用 gcc 4.3,4.4和4.6(中继)进行编译时,我可以重现此问题。 p> 解决方案好吧,拿起诱饵。 这是系统上的 Intel (TM)2 Quad CPU Q9550 @ 2.83GHz 4x5666.59 bogomips Linux meerkat 2.6.35-28-generic-pae#50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU / Linux gcc版本4.4.5 总共使用的可用共享缓冲区高速缓存 Mem:8127172 4220560 3906612 0 374328 2748796 - / + buffers / cache:1097436 7029736 Swap:0 0 0 h2> 我刚刚运行 time ./ompmemtest Id 0要释放所有内存:258.144 MB Id 0完成,总分配-1572.7MB Id 3要释放所有内存:257.854 MB Id 3完成,总分配-1569.6MB Id 1要释放所有内存:257.339 MB Id 2要释放所有内存:257.043 MB Id 1完成,总分配-1570.42MB Id 2 done,total alloc'ed -1569.96MB real 0m13.429s 用户0m44.619s sys 0m6.000s 没有什么壮观。这里是 vmstat -SM 1的同时输出 Vmstat原始数据 procs ----------- memory ---------- --- swap-- ----- io -----系统 - ---- cpu ---- 0 0 0 3892 364 2669 0 0 24 0 701 1487 2 1 97 0 4 0 0 3421 364 2669 0 0 0 0 1317 1953 53 7 40 0 4 0 0 2858 364 2669 0 0 0 0 2715 5030 79 16 5 0 4 0 0 2861 364 2669 0 0 0 0 6164 12637 76 15 9 0 4 0 0 2853 364 2669 0 0 0 0 4845 8617 77 13 10 0 4 0 0 2848 364 2669 0 0 0 0 3782 7084 79 13 8 0 5 0 0 2842 364 2669 0 0 0 0 3723 6120 81 12 7 0 4 0 0 2835 364 2669 0 0 0 0 3477 4943 84 9 7 0 4 0 0 2834 364 2669 0 0 0 0 3273 4950 81 10 9 0 5 0 0 2828 364 2669 0 0 0 0 3226 4812 84 11 6 0 4 0 0 2823 364 2669 0 0 0 0 3250 4889 83 10 7 0 4 0 0 2826 364 2669 0 0 0 0 3023 4353 85 10 6 0 4 0 0 2817 364 2669 0 0 0 0 3176 4284 83 10 7 0 4 0 0 2823 364 2669 0 0 0 0 3008 4063 84 10 6 0 0 0 0 3893 364 2669 0 0 0 0 4023 4228 64 10 26 0 对你有什么吗? Google线程缓存Malloc 现在为了真正的乐趣,添加一点香料 LD_PRELOAD =/ usr / lib / libtcmalloc.so./ompmemtest Id 1要释放所有内存:257.339 MB Id 1完成,总分配-1570.42MB Id 3要释放所有内存:257.854 MB Id 3完成,总分配-1569.6MB Id 2要释放所有内存:257.043 MB Id 2完成,总分配 - 1569.96MB Id 0要释放所有内存:258.144 MB Id 0完成,总分配-1572.7MB 实数0m11.663s 用户0m44。 255s sys 0m1.028s 看起来更快,不是吗? procs ----------- memory ---------- --- swap-- ---- -io ----- system ---- ---- cpu ---- 4 0 0 3562 364 2684 0 0 0 0 1041 1676 28 7 64 0 4 2 0 2806 364 2684 0 0 0 172 1641 1843 84 14 1 0 4 0 0 2758 364 2685 0 0 0 0 1520 1009 98 2 1 0 4 0 0 2747 364 2685 0 0 0 0 1504 859 98 2 0 0 5 0 0 2745 364 2685 0 0 0 0 1575 1073 98 2 0 0 5 0 0 2739 364 2685 0 0 0 0 1415 743 99 1 0 0 4 0 0 2738 364 2685 0 0 0 0 1526 981 99 2 0 0 4 0 0 2731 364 2685 0 0 0 684 1536 927 98 2 0 0 4 0 0 2730 364 2685 0 0 0 0 1584 1010 99 1 0 0 5 0 0 2730 364 2685 0 0 0 0 1461 917 99 2 0 0 4 0 0 2729 364 2685 0 0 0 0 1561 1036 99 1 0 0 4 0 0 2729 364 2685 0 0 0 0 1406 756 100 1 0 0 0 0 0 3819 364 2685 0 0 0 4 1159 1476 26 3 71 0 如果您想要比较vmstat输出 Valgrind --tool massif 这是 ms_print 在之后的输出的开头valgrind --tool = massif ./ ompmemtest (默认malloc): ------------- -------------------------------------------------- ----------------- 命令:./ompmemtest 块参数:(无) ms_print参数:massif.out.beforetcmalloc ------------------------------------------------ -------------------------------- GB 1.009 ^: | ::::::::::::::::::::::::::: ::: @ :::::: @ ::: | #:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ ::: | #:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ ::: | :#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ ::: | :#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ ::: | :#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ :::: | :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ :::: | :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ :::: | :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ :::: | :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ :::: | :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ :::: | :::: :: :: @ :::: :: @ :: ::: @ :: :: @ :::: @ @ :: :: :::::: :: ::: ::: @ :::: | ::: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::: :: @ :::: | ::: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::: :: @ :::: | :: ::#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @:: ::: @ :::: | :: ::#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @:: ::: @ :::: | :: :: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :: :::: @ :::: | :: :: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :: :::: @ :::: | :: :: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :: :::: @ :::: 0 + ----------------------------------- ------------------------------------> Gi 0 264.0 快照数:63 详细快照:[6(peak),10,17,23,27,30,35,39,48,56] Google HEAPPROFILE 不幸的是,vanilla valgrind 不能使用 tcmalloc ,因此我切换了马midrace 到 google-perftools 进行堆分析 gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc ++ -ltcmalloc -o ompmemtest time HEAPPROFILE = / tmp / heapprofile ./ompmemtest 开始跟踪堆将堆配置文件转储到/tmp/heapprofile.0001.heap(当前使用的100 MB)将堆配置文件转储到/tmp/heapprofile.0002.heap(当前使用的200 MB)将堆配置文件转储到/ tmp / heapprofile .0003.heap(当前使用的300 MB)将堆配置文件转储到/tmp/heapprofile.0004.heap(当前正在使用400 MB)将堆配置文件转储到/tmp/heapprofile.0005.heap (目前使用的是501 MB)将堆配置文件转储到/tmp/heapprofile.0006.heap(当前正在使用601 MB)将堆配置文件转储到/tmp/heapprofile.0007.heap(目前为701 MB在使用中)将堆配置文件转储到/tmp/heapprofile.0008.heap(当前正在使用801 MB)将堆配置文件转储到/tmp/heapprofile.0009.heap(当前正在使用902 MB)将堆配置文件转储到/tmp/heapprofile.0010.heap(当前正在使用1002 MB)将堆配置文件转储到/tmp/heapprofile.0011.heap(累积分配2029 MB,当前使用1031 MB) 将堆配置文件转储到/tmp/heapprofile.0012.heap(累积分配3053 MB,当前正在使用1030 MB)将堆配置文件转储到/tmp/heapprofile.0013.heap(累积分配4078 MB, 1031 MB当前使用)将堆配置文件转储到/tmp/heapprofile.0014.heap(累计分配5102 MB,当前使用1031 MB)将堆配置文件转储到/tmp/heapprofile.0015.heap (累计分配6126 MB,当前使用1033 MB)将堆配置文件转储到/tmp/heapprofile.0016.heap(累计分配7151 MB,当前使用1029 MB)将堆配置文件转储到/ tmp /heapprofile.0017.heap(累积分配8175 MB,目前使用1029 MB)将堆配置文件转储到/tmp/heapprofile.0018.heap(累计分配9199 MB,当前使用1028 MB) Id 0要释放所有内存:258.144 MB Id 0完成,总分配-1572.7MB Id 2要释放所有内存:257.043 MB Id 2 done,total alloc' ed -1569.96MB Id 3要释放所有内存:257.854 MB Id 3完成,总分配-1569.6MB Id 1要释放所有内存:257.339 MB Id 1 done,total alloc'ed -1570.42MB 将堆配置文件转储到/tmp/heapprofile.0019.heap(退出) real 0m11.981s 用户0m44。 455s sys 0m1.124s em> 更新 发表评论:我更新了程序 --- omptest / openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200 +++ q / openMpMemtest_Linux.cpp 2011-05 -04 13:42:47.371726000 +0200 @@ -13,8 +13,8 @@ void runParallelAllocTest() { // constants - const int NUM_ALLOCATIONS = 5000; // alloc's per threads - const int NUM_THREADS = 4; //多少线程? + const int NUM_ALLOCATIONS = 55000; // alloc's per thread + const int NUM_THREADS = 8; //多少线程? const int NUM_ITERS = NUM_THREADS; //多少次重复 const bool USE_NEW = true; // use new或malloc? ,似乎没有区别(因为它应该) 它跑了超过5m3s。接近尾声,htop的屏幕截图教会了,保留集稍微更高,朝2.3g: 1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | 96.7%]任务:总共125个,2个运行 2 [||||||||||||||||||||||||||||||| |||||||||||||||| 96.7%]平均负载:8.09 5.24 2.37 3 [||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||| 97.4%]正常运行时间:01:54:22 4 [|| ||||||||||||||||||||||||||||||||||||||||||||| 96.1% ] Mem [||||||||||||||||||||||||||||||||||| 3055 / 7936MB] Swp [0 / 0MB] PID用户NLWP PRI NI VIRT RES SHR S CPU%MEM%TIME +命令 4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest 将结果与tcmalloc运行比较:4m12s,类似的顶级数据有微小的差异;最大的区别在于VIRT集合(但是除非每个进程的地址空间非常有限,否则这不是特别有用)。 RES集非常相似,如果你问我。 要注意的更重要的事情是平行度增加;所有内核现在已达到最大值。这显然是由于使用tcmalloc时减少了对堆操作的锁定需求: 如果自由列表是空的:(1)我们从这个大小类的中央自由列表中获取一堆对象(中央自由列表由所有线程共享)。 (2)将它们放在线程本地自由列表中。 (3)将新获取的对象之一返回给应用程序。 1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||| 100.0%]任务:总共172个,运行2个 2 [|||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%]负载平均: 7.39 2.92 1.11 3 [|||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||| 100.0%]正常运行时间:11:12:25 4 [||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | 100.0%] Mem [||||||||||||||||||||||||||||||||||||||||||| |||| 3278 / 7936MB] Swp [0 / 0MB] PID用户NLWP PRI NI VIRT RES SHR S CPU%MEM%TIME +命令 14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest DescriptionWhen allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?Test Program DescriptionThe demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).Demo ProgramAs this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.For completeness, the actual code follows:#include <stdlib.h>#include <stdio.h>#include <iostream>#include <vector>#include <deque>#include <omp.h>#include <math.h>typedef unsigned long long uint64_t;void runParallelAllocTest(){ // constants const int NUM_ALLOCATIONS = 5000; // alloc's per thread const int NUM_THREADS = 4; // how many threads? const int NUM_ITERS = NUM_THREADS;// how many overall repetions const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should) const bool DEBUG_ALLOCS = false; // debug output // pre store allocation sizes const int NUM_PRE_ALLOCS = 20000; const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1; srand(1); std::vector<size_t> allocations; allocations.resize(NUM_PRE_ALLOCS); for (int i = 0; i < NUM_PRE_ALLOCS; i++) { allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks } #pragma omp parallel num_threads(NUM_THREADS) #pragma omp for for (int i = 0; i < NUM_ITERS; ++i) { uint64_t long totalAllocBytes = 0; uint64_t currAllocBytes = 0; std::deque< std::pair<char*, uint64_t> > pointers; const int myId = omp_get_thread_num(); for (int j = 0; j < NUM_ALLOCATIONS; ++j) { // new allocation const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ]; char* pnt = NULL; if (USE_NEW) { pnt = new char[allocSize]; } else { pnt = (char*) malloc(allocSize); } pointers.push_back(std::make_pair(pnt, allocSize)); totalAllocBytes += allocSize; currAllocBytes += allocSize; // fill with values to add "delay" for (int fill = 0; fill < (int) allocSize; ++fill) { pnt[fill] = (char)(j % 255); } if (DEBUG_ALLOCS) { std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n"; } // free all or just a bit if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) { int frees = 0; // keep this much allocated // last check, free all uint64_t memLimit = MEM_LIMIT; if (j == NUM_ALLOCATIONS - 1) { std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl; memLimit = 0; } //MEM_LIMIT = 0; // DEBUG while (pointers.size() > 0 && (currAllocBytes > memLimit)) { // free one of the first entries to allow previously obtained resources to 'live' longer currAllocBytes -= pointers.front().second; char* pnt = pointers.front().first; // free memory if (USE_NEW) { delete[] pnt; } else { free(pnt); } // update array pointers.pop_front(); if (DEBUG_ALLOCS) { std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n"; } frees++; } if (DEBUG_ALLOCS) { std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n"; } } } // for each allocation if (currAllocBytes != 0) { std::cerr << "Not all free'd!\n"; } std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n"; } // for each iteration exit(1);}int main(int argc, char** argv){ runParallelAllocTest(); return 0;}The Test-SystemFrom what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHzUbuntu 10.04 LTS 64 bitgcc 4.3, 4.4, 4.63988.62 BogomipsTestingOnce you have executed the makefile, you should get a file named ompmemtest. To query the memory usage over time, I used the following commands:./ompmemtest &top -b | grep ompmemtestWhich yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtestPlease Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk). 解决方案 Ok, picked up the bait.This is on a system with Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz4x5666.59 bogomipsLinux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linuxgcc version 4.4.5 total used free shared buffers cachedMem: 8127172 4220560 3906612 0 374328 2748796-/+ buffers/cache: 1097436 7029736Swap: 0 0 0Naive runI just ran ittime ./ompmemtestId 0 about to release all memory: 258.144 MBId 0 done, total alloc'ed -1572.7MBId 3 about to release all memory: 257.854 MBId 3 done, total alloc'ed -1569.6MBId 1 about to release all memory: 257.339 MBId 2 about to release all memory: 257.043 MBId 1 done, total alloc'ed -1570.42MBId 2 done, total alloc'ed -1569.96MBreal 0m13.429suser 0m44.619ssys 0m6.000sNothing spectacular. Here is the simultaneous output of vmstat -S M 1Vmstat raw dataprocs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 0 0 3892 364 2669 0 0 24 0 701 1487 2 1 97 0 4 0 0 3421 364 2669 0 0 0 0 1317 1953 53 7 40 0 4 0 0 2858 364 2669 0 0 0 0 2715 5030 79 16 5 0 4 0 0 2861 364 2669 0 0 0 0 6164 12637 76 15 9 0 4 0 0 2853 364 2669 0 0 0 0 4845 8617 77 13 10 0 4 0 0 2848 364 2669 0 0 0 0 3782 7084 79 13 8 0 5 0 0 2842 364 2669 0 0 0 0 3723 6120 81 12 7 0 4 0 0 2835 364 2669 0 0 0 0 3477 4943 84 9 7 0 4 0 0 2834 364 2669 0 0 0 0 3273 4950 81 10 9 0 5 0 0 2828 364 2669 0 0 0 0 3226 4812 84 11 6 0 4 0 0 2823 364 2669 0 0 0 0 3250 4889 83 10 7 0 4 0 0 2826 364 2669 0 0 0 0 3023 4353 85 10 6 0 4 0 0 2817 364 2669 0 0 0 0 3176 4284 83 10 7 0 4 0 0 2823 364 2669 0 0 0 0 3008 4063 84 10 6 0 0 0 0 3893 364 2669 0 0 0 0 4023 4228 64 10 26 0Does that information mean anything to you?Google Thread Caching MallocNow for real fun, add a little spicetime LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtestId 1 about to release all memory: 257.339 MBId 1 done, total alloc'ed -1570.42MBId 3 about to release all memory: 257.854 MBId 3 done, total alloc'ed -1569.6MBId 2 about to release all memory: 257.043 MBId 2 done, total alloc'ed -1569.96MBId 0 about to release all memory: 258.144 MBId 0 done, total alloc'ed -1572.7MBreal 0m11.663suser 0m44.255ssys 0m1.028sLooks faster, not?procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 4 0 0 3562 364 2684 0 0 0 0 1041 1676 28 7 64 0 4 2 0 2806 364 2684 0 0 0 172 1641 1843 84 14 1 0 4 0 0 2758 364 2685 0 0 0 0 1520 1009 98 2 1 0 4 0 0 2747 364 2685 0 0 0 0 1504 859 98 2 0 0 5 0 0 2745 364 2685 0 0 0 0 1575 1073 98 2 0 0 5 0 0 2739 364 2685 0 0 0 0 1415 743 99 1 0 0 4 0 0 2738 364 2685 0 0 0 0 1526 981 99 2 0 0 4 0 0 2731 364 2685 0 0 0 684 1536 927 98 2 0 0 4 0 0 2730 364 2685 0 0 0 0 1584 1010 99 1 0 0 5 0 0 2730 364 2685 0 0 0 0 1461 917 99 2 0 0 4 0 0 2729 364 2685 0 0 0 0 1561 1036 99 1 0 0 4 0 0 2729 364 2685 0 0 0 0 1406 756 100 1 0 0 0 0 0 3819 364 2685 0 0 0 4 1159 1476 26 3 71 0In case you wanted to compare vmstat outputsValgrind --tool massifThis is the head of output from ms_print after valgrind --tool=massif ./ompmemtest (default malloc):--------------------------------------------------------------------------------Command: ./ompmemtestMassif arguments: (none)ms_print arguments: massif.out.beforetcmalloc-------------------------------------------------------------------------------- GB1.009^ : | ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: | # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: 0 +----------------------------------------------------------------------->Gi 0 264.0Number of snapshots: 63 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]Google HEAPPROFILEUnfortunately, vanilla valgrind doesn't work with tcmalloc, so I switched horses midrace to heap profiling with google-perftoolsgcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtesttime HEAPPROFILE=/tmp/heapprofile ./ompmemtestStarting tracking the heapDumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)Id 0 about to release all memory: 258.144 MBId 0 done, total alloc'ed -1572.7MBId 2 about to release all memory: 257.043 MBId 2 done, total alloc'ed -1569.96MBId 3 about to release all memory: 257.854 MBId 3 done, total alloc'ed -1569.6MBId 1 about to release all memory: 257.339 MBId 1 done, total alloc'ed -1570.42MBDumping heap profile to /tmp/heapprofile.0019.heap (Exiting)real 0m11.981suser 0m44.455ssys 0m1.124sContact me for full logs/detailsUpdateTo the comments: I updated the program--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200+++ q/openMpMemtest_Linux.cpp 2011-05-04 13:42:47.371726000 +0200@@ -13,8 +13,8 @@ void runParallelAllocTest() { // constants- const int NUM_ALLOCATIONS = 5000; // alloc's per thread- const int NUM_THREADS = 4; // how many threads?+ const int NUM_ALLOCATIONS = 55000; // alloc's per thread+ const int NUM_THREADS = 8; // how many threads? const int NUM_ITERS = NUM_THREADS;// how many overall repetions const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g: 1 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Tasks: 125 total, 2 running 2 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Load average: 8.09 5.24 2.37 3 [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%] Uptime: 01:54:22 4 [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%] Mem[||||||||||||||||||||||||||||||| 3055/7936MB] Swp[ 0/0MB] PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtestComparing results with a tcmalloc run: 4m12s, similar top stats has minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc: If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications. 1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 172 total, 2 running 2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Load average: 7.39 2.92 1.11 3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Uptime: 11:12:25 4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Mem[|||||||||||||||||||||||||||||||||||||||||||| 3278/7936MB] Swp[ 0/0MB] PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest 这篇关于多线程是否强调内存碎片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云! 08-20 06:34