本文介绍了内核冻结:如何调试它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有数千行内核模块的嵌入式板卡,这些模块在随机时间和复杂用例上冻结.我尝试调试它的解决方案是什么?

I have an embedded board with a kernel module of thousands of lines which freeze on random and complexe use case with random time. What are the solution for me to try to debug it ?

我已经尝试了魔法系统请求,但它不起作用.我猜的解释是我在禁用硬件中断的代码中处于循环或死锁中?

I have already try magic System Request but it does not work. I guess that the explanation is that I am in a loop or a deadlock in a code where hardware interrupt is disable ?

谢谢,伊娃.

推荐答案

通常,嵌入式开发板有一个 看门狗.您应该启用此计时器并使用 watchdog 用户进程来踢看门狗硬件.在 watchdog 进程上使用 nice 以便更高优先级的任务必须放弃 CPU.这为问题提供了线索.如果设备没有在看门狗激活的情况下重置,那么可能只有网络串口停止了通信.即,内核尚未锁定.问题是没有用户可见的活动.如果/当现场发生此类问题时,看门狗也很有用.

Typically, embedded boards have a watch dog. You should enable this timer and use the watchdog user process to kick the watch dog hard ware. Use nice on the watchdog process so that higher priority tasks must relinquish the CPU. This gives clues as to the issue. If the device does not reset with a watch dog active, then it maybe that only the network or serial port has stopped communicating. Ie, the kernel has not locked up. The issue is that there is no user visible activity. The watch dog is also useful if/when this type of issue occurs in the field.

对于内核锁定情况,lockup watchdogs 内核功能可能有用.如果您有推测的无限循环/死锁,这将起作用.但是,如果这是自定义硬件,则 SDRAM外围设备也有可能锁定并导致异常总线活动.这将阻止 CPU 获取正确的代码;显然,Linux 很难从中恢复过来.

For a kernel lockup case, the lockup watchdogs kernel features maybe useful. This will work if you have an infinite loop/deadlock as speculated. However, if this is custom hardware, it is also possible that SDRAM or a peripheral device latches up and causes abnormal bus activity. This will stop the CPU from fetching proper code; obviously, it is tough for Linux to recover from this.

您可以将看门狗与一些用作跟踪缓冲区空闲内存结合起来.memmap=mem= 可以限制内核使用的内存.可以编写使用此内存的驱动程序/设备,以保存重新启动后仍然存在的跟踪点.当内核启动时检测到看门狗重置时,空闲内存的环形缓冲区将被转储.

You can combine the watchdog with some fallow memory that is used as a trace buffer. memmap= and mem= can limit the memory used by the kernel. A driver/device using this memory can be written that saves trace points that survive a reboot. The fallow memory's ring buffer is dumped when a watchdog reset is detected on kernel boot.

注册线程也很有用 notifiers 可以在上下文切换时执行 printk,如果问题是可重复的或发现如何使事件可重复.一旦确定了导致锁定的一系列事件,您就可以使用范围逻辑分析器来做一些最终诊断.或者,可能很明显此时哪个外围设备是问题.

It is also useful to register thread notifiers that can do a printk on context switches, if the issue is repeatable or to discover how to make the event repeatable. Once you determine a sequence of events that leads to the lockup, you can use the scope or logic analyzer to do some final diagnosis. Or, it maybe evident which peripheral is the issue at this point.

您也可以在内核命令行上设置 panic=-1reboot=....kdump 工具很有用,如果您只有代码问题.

You may also set panic=-1 and reboot=... on the kernel command line. The kdump facilities are useful, if you only have a code problem.

相关:内核陷阱(在网络存档中).此链接可能不再可用,但对于此答案并不重要.

Related: kernel trap (at web archive). This link may no longer be available, but aren't important to this answer.

这篇关于内核冻结:如何调试它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-29 20:19