本文介绍了如何衡量在java平台下上下文切换所花费的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们假设每个线程正在进行FP计算,我感兴趣

Let's assume each thread is doing some FP calculation, I am interested in


  • CPU在切换线程中使用了多长时间运行它们

  • 在共享内存总线上创建了多少同步流量 - 当线程共享数据时,它们必须使用同步机制

我的问题:如何设计测试程序来获取这些数据?

My question: how to design a test program to get this data?

推荐答案

您无法轻易区分由于线程切换和内存缓存争用导致的浪费。你可以测量线程争用。也就是说,在linux上,你可以cat / proc / PID / XXX并获得大量详细的每线程统计信息。但是,由于先发制人的调度程序不会在脚下射击,无论你使用多少线程,你都不会超过每秒30个ctx开关。而且这个时间相对较长小对比你正在做的工作量。上下文切换的实际成本是缓存污染。例如一旦你上下文切换回来,你很可能会有大部分缓存未命中。因此,操作系统时间和上下文切换计数的价值最小。

You can't easily differentiate the waste due to thread-switching and that due to memory cache contention. You CAN measure the thread contention.. Namely, on linux, you can cat /proc/PID/XXX and get tons of detailed per-thread statistics. HOWEVER, since the pre-emptive scheduler is not going to shoot itself in the foot, you're not going to get more than say 30 ctx switches per second no matter how many threads you use.. And that time is going to be relatively small v.s. the amount of work you're doing.. The real cost of context-switching is the cache pollution. e.g. there is a high probability that you'll have mostly cache misses once you're context-switched back in. Thus OS time and context-switch-counts are of minimal value.

真正有价值的是线程间缓存线污垢的比例。根据CPU的不同,高速缓存行脏,然后是对等CPU读取是SLOWER而不是高速缓存未命中 - 因为您必须强制对等CPU将其值写入main-mem才能开始读取。 CPU可以让你从对等缓存行中提取而不需要命中main-mem。

What's REALLY valuable is the ratio of inter-thread cache-line dirties. Depending on the CPU, a cache-line dirty followed by a peer-CPU read is SLOWER than a cache-miss - because you have to force the peer CPU to write it's value to main-mem before you can even start reading.. Some CPUs let you pull from peer cache-lines without hitting main-mem.

所以关键是绝对最小化任何共享的修改内存结构。将所有内容都设为只读尽可能..这包括共享FIFO缓冲区(包括执行程序池)。即,如果您使用了同步队列 - 那么每个sync-op都是一个共享的脏内存区域。而且,如果速率足够高,它可能会触发操作系统陷阱停止,等待对等线程的互斥锁。

So the key is the absolutely minimize ANY shared modified memory structures.. Make everything as read-only as possible.. This INCLUDES share FIFO buffers (including Executor pools).. Namely if you used a synchronized queue - then every sync-op is a shared dirty memory region. And more-over, if the rate is high enough, it'll likely trigger an OS trap to stall, waiting for peer thread's mutex's.

理想的是分段RAM ,将一个大型工作单元分配给固定数量的工作者,然后使用倒计时锁存器或其他一些内存屏障(这样每个线程只会触摸一次)。理想情况下,任何临时缓冲区都是预先分配的,而不是进出共享内存池(然后导致缓存争用)。 Java'synchronized'块利用(幕后)共享的哈希表内存空间,从而触发不需要的脏读,我还没有确定java 5 Lock对象是否避免了这种情况,但你仍然在利用赢得的操作系统停顿帮助你的吞吐量。显然,大多数OutputStream操作会触发此类同步调用(当然通常也会填充公共流缓冲区)。

The ideal is to segment RAM, distribute to a fixed number of workers a single large unit of work, then use a count-down-latch or some other memory barrier (such that each thread would only touch it once). Ideally any temporary buffers are pre-allocated instead of going into and out of a shared memory pool (which then causes cache contention). Java 'synchronized' blocks leverage (behind the scenes) a shared hash-table memory space and thus trigger the undesirable dirty-reads, I haven't determined if java 5 Lock objects avoid this, but you're still leveraging OS stalls which won't help in your throughput. Obviously most OutputStream operations trigger such synchronized calls (and of course are typically filling a common stream buffer).

一般来说,我的经验是单线程比普通字节数组/对象数组等的多线程更快。至少我使用简单的排序/过滤算法已经尝试过。根据我的经验,在Java和C中都是如此。我没有尝试过FPU intesive ops(比如divides,sqrt),其中缓存行可能不是一个因素。

Generally my experience is that single-threading is faster than mulithreading for a common byte-array/object-array, etc. At least with simplistic sorting/filtering algorithms that I've experimented with. This is true both in Java and C in my experience. I haven't tried FPU intesive ops (like divides, sqrt), where cache-lines may be less of a factor.

基本上如果你是一个单独的CPU你没有缓存行问题(除非操作系统总是在共享线程中刷新缓存),但多线程购买的东西比什么都没有。在超线程中,这是同样的交易。在单CPU共享L2 / L3缓存配置(例如AMD)中,您可能会发现一些好处。在多CP​​U Intel BUS中,忘掉它 - 共享写内存比单线程更差。

Basically if you're a single CPU you don't have cache-line problems (unless the OS is always flushing the cache even in shared threads), but multithreading buys you less than nothing. In hyperthreading, it's the same deal. In single-CPU shared L2/L3 cache configurations (e.g. AMDs), you might find some benefit. In multi CPU Intel BUS's, forget it - shared write-memory is worse than single-threading.

这篇关于如何衡量在java平台下上下文切换所花费的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 06:58