问题描述
我有一个Java应用程序大部分时间都消耗100%的CPU(仙人掌和顶级监控程序表明)。我们启动了YourKit(它确认了CPU资源问题),并将java.net.SocketInputStream.read(byte [],int,int)标识为15%的时间的最大热点。我相信他们无法像SocketInputStream.read这样准确地测量执行阻塞IO的方法的CPU时间。
I've got a Java application that is consuming 100% of the CPU most of the time (as indicated by cacti and top monitoring). We fired up YourKit (which confirms the CPU resource issue) and it identifies java.net.SocketInputStream.read(byte[], int, int) as the biggest hot spot at 15% of time. I believe they aren't accurately measuring CPU time for methods that perform blocking IO like SocketInputStream.read would.
还有6个其他确定的热点,但它们所占的比例较小占总CPU时间的20%以上。都在5%-1%范围内。
There are 6 other identified hot spots, but they account for less than 20% of accounted for CPU time combined. all in the 5%-1% range.
所以我知道我有一个问题,我可以看到问题,YourKit也有,但是我离确定问题还很近实际的问题。
So I know I have a problem, I can see the problem, YourKit does too, but I am no closer to identifying the actual problem.
我对使用探查器非常陌生,很可能会丢失一些东西。有想法吗?
I am pretty new to using a profiler, and am most likely missing something. Any ideas?
编辑:Sean很好地介绍了使用系统内置的工具。如果我使用top和shift + h查看线程,它将显示7-15个线程中的任何位置,并且CPU利用率会跳跃。我不认为这是导致问题的任何一个线程,而是每个线程在某个时间执行的一段代码。
Sean makes a good point about using tools built into the system. If I use top and shift+h to view threads, it displays anywhere from 7-15 threads, and the CPU utilization jumps around. I don't believe it's any one thread that is causing the problem, rather it is a piece of code each thread executes at some time.
推荐答案
如果可以的话,我建议在Solaris机器上运行它。如果没有Solaris,请考虑在运行Open Solaris的情况下设置虚拟机。
I would recommend running this on a Solaris box if you can. If you don't have a Solaris box consider setting a Virtual Machine up with Open Solaris running on it.
Solaris提供了一个名为
Solaris offers a tool called prstat
Prstat的工作原理与大多数人最喜欢的顶部熟悉。重要的区别是prstat可以为您分解进程,并显示进程中的每个线程。
Prstat works much like top which most people are familiar with. The important difference is prstat can break the processes up for you and show each thread within a process.
对于您而言,用法是
prstat -L 0 1
For your case the usage would beprstat -L 0 1
与线程转储配对(最好在脚本中执行此操作),您可以将LWPID匹配在一起,以准确地确定哪个线程是CPU吞噬。
Paired with a thread dump (doing this in a script is preferred) you can match the LWPID together to find exactly which thread is the CPU hog.
以下是一个功能示例(我创建了一个小应用程序,用于poc的大循环)
Here is a functional example (I created a small app going in a big loop for poc)
Standard Top会向您显示类似于以下内容的
Standard Top will show you something like the following
PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
924 username 10 59 0 31M 11M run 0:53 36.02% java
然后使用prstat使用以下命令
Then using prstat The following command was used
prstat -L 0 1 | grep java > /export/home/username/Desktop/output.txt
以及prstat的输出
And the output from prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID
924 username 31M 10M run 30 0 0:00:09 35% java/10
924 username 31M 10M sleep 59 0 0:00:00 0.8% java/3
924 username 31M 10M sleep 59 0 0:00:00 0.6% java/2
924 username 31M 10M sleep 59 0 0:00:00 0.3% java/1
与顶部不同,但是如果您注意到数据的右侧,则PROCESS / LWPID会告诉您Java进程中正在消耗CPU的确切线程。以轻量进程ID(lwpid)10运行的线程正在消耗35%的CPU。如前所述,如果将其与线程转储配对,则可以找到确切的线程。就我而言,这是线程转储的相关部分
This may not look much different then top, but if you notice to the right side of the data, the PROCESS/LWPID is telling you the exact thread within the java process which is consuming the CPU. the thread running with the light weight process id (lwpid) 10 is consuming 35% of the CPU. As I mentioned before, if you pair this with a thread dump, you can find the exact thread. For my case, this is the relevant portion of the thread dump
"Thread-0" prio=3 tid=0x08173800 nid=0xa runnable [0xc60fc000..0xc60fcae0]
java.lang.Thread.State: RUNNABLE
at java.util.Random.next(Random.java:139)
at java.util.Random.nextInt(Random.java:189)
at ConsumerThread.run(ConsumerThread.java:13)
在线程的顶部, nid 可以与LWPID匹配。 nid = 0xa(从十六进制转换时,十进制为10英寸)
On the top line of the thread, the nid can be matched to the LWPID. nid=0xa (which is 10 in dec when converted from Hex)
如果您可以将prstat和thread dump命令放在脚本中并在运行期间运行4-5次CPU使用率过高时,您将开始看到模式,并能够以此方式确定CPU使用率过高的原因。
If you can put the prstat and thread dump commands in a script and run it 4-5 times during high CPU usages you will begin to see patterns and able to determine the cause of your high CPU that way.
在我看来,我看到这种结果是由于长时间运行gc时间到LDAP连接配置错误而导致的。玩得开心:)
In my time, I have seen this result from long running gc times to a misconfiguration of an LDAP connection. Have fun :)
这篇关于带有YourKit的已分析应用,仍然无法识别CPU占用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!