问题描述
我们有一个服务器应用程序,做了很多的内存分配(包括短暂的,长寿命)的。我们看到非常多的GC2集合启动后不久,但这些集合了一段时间(即使内存分配格局是不变的)后,冷静下来。这些藏品都是早早就创下性能。
We have a server app that does a lot of memory allocations (both short lived and long lived). We are seeing an awful lot of GC2 collections shortly after startup, but these collections calm down after a period of time (even though the memory allocation pattern is constant).These collections are hitting performance early on.
我猜,这可以通过GC预算造成的(对于第二代?)。有没有一些方法,我可以把这个预算(直接或间接),使我的服务器有更好的表现开头?
I'm guessing that this could be caused by GC budgets (for Gen2?). Is there some way I can set this budget (directly or indirectly) to make my server perform better at the beginning?
一个反直觉的结果集我看过:我们做了一个大大减少的内存量(和大对象堆)分配,看到长期业绩改善,但前期表现恶化,而安顿期间变长。
One counter-intuitive set of results I've seen: We made a big reduction to the amount of memory (and Large Object Heap) allocations, which saw performance over the long term improve, but early performance gets worse, and the "settling down" period gets longer.
在GC显然需要一定的时间周期,以实现我们的应用程序是一个内存猪,并做出相应的调整。我已经知道这个事实,我怎么说服GC?
The GC apparently needs a certain period of time to realise our app is a memory hog and adapt accordingly. I already know this fact, how do I convince the GC?
修改
- 操作系统:64位Windows Server 2008 R2
- 我们正在使用的.Net 4.0 ServerGC批延迟。试了4.5和3个不同的延时模式,虽然平均表现略有好转,最坏情况下的性能实际上变差
EDIT2
- 系统GC尖峰可以增加一倍所需的时间(我们正在谈论秒为单位)接受去不可接受
- 在几乎所有的尖峰关联与第2代集合
- 在我的测试运行导致最终32GB的堆大小。最初的泡沫化持续的运行时间1日1/5,在那之后的表现实际上是更好的(不那么频繁尖峰),即使堆越来越大。附近测试(最大堆大小)结束的最后一个扣球是相同的高度(即作为坏)2,在最初的训练期间的尖峰(与更小堆)
推荐答案
分配可以是出奇的快,而且会阻塞集合数不会从快在于prevent它。你观察问题的事实,你不只是分配,同时也有code,导致依赖重组和实际的垃圾收集,都在同一时间的分配是怎么回事造成的。
Allocation of extremely large heap in .NET can be insanely fast, and number of blocking collections will not prevent it from being that fast. Problems that you observe are caused by the fact that you don't just allocate, but also have code that causes dependency reorganizations and actual garbage collection, all at the same time when allocation is going on.
有一些技巧需要考虑:
-
请尝试使用LatencyMode(的),其设置为LowLatency,而你正在积极加载数据 - 看到这个答案的评论以及
try using LatencyMode (http://msdn.microsoft.com/en-us/library/system.runtime.gcsettings.latencymode(v=vs.110).aspx), set it to LowLatency while you are actively loading the data - see comments to this answer as well
使用多线程
不填充交叉引用到新分配的对象,同时,积极加载; 先通过积极的分配阶段,只能使用整数索引交叉引用的项目,而不是管理参考;然后强制完整的GC几次让一切都在第二代,才把填充您的高级数据结构;你可能需要重新考虑你的反序列化逻辑,要做到这一点
do not populate cross-references to newly allocated objects while actively loading; first go through active allocation phase, use only integer indexes to cross-reference items, but not managed references; then force full GC couple times to have everything in Gen2, and only then populate your advanced data structures; you may need to re-think your deserialization logic to make this happen
试着尽可能早地迫使你最大根集合(对象数组,字符串),以第二代;由$ P $做到这一点pallocating他们并迫使完整的GC两次,你开始填充数据(装载数以百万计的小物件)之前;如果您使用的是通用字典的一些味道,一定要preallocate的能力早,以避免重组
try forcing your biggest root collections (arrays of objects, strings) to second generation as early as possible; do this by preallocating them and forcing full GC two times, before you start populating data (loading millions of small objects); if you are using some flavor of generic Dictionary, make sure to preallocate its capacity early on, to avoid reorganizations
引用任何大的阵列是GC开销一大来源 - 直到两个数组引用的对象是第二代;更大的阵列 - 更大的开销;指标preFER数组引用数组,特别是对临时处理需求
any big array of references is a big source of GC overhead - until both array and referenced objects are in Gen2; the bigger the array - the bigger the overhead; prefer arrays of indexes to arrays of references, especially for temporary processing needs
避免许多实用程序或临时对象释放或晋升而在任何线程主动载相,仔细一看,通过你的code的字符串连接,拳击和'的foreach迭代器不能进行自动优化为'为'循环
avoid having many utility or temporary objects deallocated or promoted while in active loading phase on any thread, carefully look through your code for string concatenation, boxing and 'foreach' iterators that can't be auto-optimized into 'for' loops
如果您有引用数组和有一些长时间运行密集的循环,避免引入了缓存从阵列中的一些位置上的参考价值局部变量函数调用的层次结构;相反,缓存的偏移值,并继续使用类似myArrayOfObjects [偏移]跨越所有级别的函数调用的构建;它帮助了我很多与处理pre填充,第二代大型数据结构,我个人的理论在这里的是,这有助于GC管理您的本地线程的数据结构的临时依赖性,从而提高并发
if you have an array of references and a hierarchy of function calls that have some long-running tight loops, avoid introducing local variables that cache the reference value from some position in the array; instead, cache the offset value and keep using something like "myArrayOfObjects[offset]" construct across all levels of your function calls; it helped me a lot with processing pre-populated, Gen2 large data structures, my personal theory here is that this helps GC manage temporary dependencies on your local thread's data structures, thus improving concurrency
下面是原因行为,据我从在应用程序启动时填充达〜100 GB的RAM了解到,随着多线程:
Here are the reasons for this behavior, as far as I learned from populating up to ~100 Gb RAM during app startup, with multiple threads:
-
当GC将数据从一代又一代,它实际上把它拷贝,因此修改所有的参考;因此,你在有源负载阶段有更少的交叉引用 - 好
when GC moves data from one generation to another, it actually copies it and thus modifies all references; therefore, the fewer cross-references you have during active load phase - the better
GC保留了很多的管理引用内部数据结构;如果你做大量的修改,引用自己 - 或者,如果你有很多有GC过程中修改引用的 - 它会导致阻塞和并发GC时显著的CPU和内存带宽开销;有时我看到GC不断消耗CPU的30-80%,没有任何收藏事情 - 直到你意识到,任何时候你把一个参考紧密循环,GC一些数组或临时变量简单地做一些处理,看起来怪异已经修改,有时甚至重组的依赖性跟踪的数据结构
GC maintains a lot of internal data structures that manage references; if you do massive modifications to references themselves - or if you have a lot of references that have to be modified during GC - it causes significant CPU and memory bandwidth overhead during both blocking and concurrent GC; sometimes I observed GC constantly consuming 30-80% of CPU without any collections going on - simply by doing some processing, which looks weird until you realize that any time you put a reference to some array or some temporary variable in a tight loop, GC has to modify and sometimes reorganize dependency tracking data structures
服务器GC使用线程特定Gen0段,并能够推动整个网段到下一代(无实际复制数据 - 不知道这一个虽然)的,设计多线程数据加载过程时,记住这一点
server GC uses thread-specific Gen0 segments and is capable of pushing entire segment to next Gen (without actually copying data - not sure about this one though), keep this in mind when designing multi-threaded data load process
ConcurrentDictionary,同时是一个伟大的API,不与多个内核,极端情况下很好地扩展,当物体的数量必须高于几百万(考虑使用非托管哈希表并发插入优化,例如具有未来英特尔TBB)
ConcurrentDictionary, while being a great API, does not scale well in extreme scenarios with multiple cores, when number of objects goes above a few millions (consider using unmanaged hashtable optimized for concurrent insertion, such as one coming with Intel's TBB)
如果有可能还是可以,请考虑使用本地池分配器(英特尔®TBB,再次)
if possible or applicable, consider using native pooled allocator (Intel TBB, again)
顺便说一句,最新更新的.NET 4.5具有大型对象堆碎片整理的支持。还有一个很大的理由去升级吧。
BTW, latest update to .NET 4.5 has defragmentation support for large object heap. One more great reason to upgrade to it.
这篇关于我可以"黄金" CLR的GC期望挥霍内存使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!