Debugging system freezes
Posted by niol on Thu 25 Jan 2007 at 10:40
Sometimes your Debian box hangs, and for a strange reason, there is no debugging information printed on your screen. What options do you have?
System logsThe first place to look for debug information is /var/log. kern.log, daemon.log, messages and dmesg often contain precious information about what went wrong. This will help you identify which hardware or even software component is causing trouble to the kernel.
Console outputKernel oopses are usually printed in /var/log/dmesg but if the problem stalls hard drive I/O, you won't find much in the log files. And if you are running X, you won't be able to see what is printed to the console. But there are ways to get output.
The first one is to use CONFIG_MAGIC_SYSRQ which enables the magic ALT+PRINTSCREEN kernel command line interface.
Serial consoleAnother one is to plug a serial console, i.e another computer with a null-modem cable on the COM port, or a dumb terminal antique, to the box on which you are experiencing problems. Then, boot your kernel with the console=ttyS[X] where X is the COM port number. From the other box, you can use gkermit to open the console from the other box. This may even work using USB but I could not find how.
The netconsole kernel moduleIf you do not have the hardware, wich is common because most laptops do not come with a COM port nowadays, you can use the netconsole module wich is very handy. It uses very low level network device calls to send via UDP console output across your network. It is included in the standard debian kernel. Using this may help you debug anything but your network device controller driver. In /etc/modprobe.d/, add a file that reads :
options netconsole netconsole=32769@192.168.1.1/eth1,32769@192.168.1.6/01:23:34:56:78:9A
Where :
- 192.168.1.1:32769 on eth1 is the ip/port/interface to use the send output from.
- 192.168.1.6:32769 and mac address 01:23:34:56:78:9A is the ip/port/mac to send packets to.
On the 192.168.1.6 box, run :
$ nc -l -p 32769 -u
Then, simply modprobe netconsole on 192.168.1.1 and output should start to appear on 192.168.1.6.
More information on Using Netconsole to See Kernel Messages.
Nothing shows when my kernel hangs!This is the worse case scenario. Linux is usually very talkative. At this point, there is a very good chance that your problem is hardware related :
- Try to reproduce with very few peripherals connected.
- Check you CPU temperature.
- The odds are good that your RAM stick has defects (this is what happened to me), so try another one.
- Do not say that you hate hardware and try to remember what it was like back in the other OS days...
Good luck, because I know this is very annoying!
[ Send Message | View Weblogs ]
Did you try memtest86+?
PJ
[ Parent | Reply to this comment ]
memtest86+ is good to mention. Sorry for omitting it in the article. For those who don't know, it is a special kernel loaded with tests for your RAM hardware.
In my case, I did not have to because I had a spare RAM stick and I knew how to reproduce the freeze. Replacing the RAM stick clearly solved the problem.
[ Parent | Reply to this comment ]
Serial console
Another one is to plug a serial console, i.e another computer with a null-modem cable on the COM port, or a dumb terminal antique, to the box on which you are experiencing problems. Then, boot your kernel with the console=ttyS[X] where X is the COM port number. From the other box, you can use gkermit to open the console from the other box. This may even work using USB but I could not find how.
AFAICR, you can't use USB for accessing a serial console (yet), which is a pain, especially for newer boards that don't include serial ports as standard (or at least, boards that provide headers, but no brackets)...
This is one of the last few remaining pieces of legacy hardware that hasn't been fully addressed.
Cheers.
[ Parent | Reply to this comment ]
Crash dump tools like kdump (based on kexec) should make debugging freezes/crashes much easier.
[ Parent | Reply to this comment ]
1) On sarge with 2.6.8-11-em64t-p4-smp kernel, I do not found the netconsole module:
# modprobe netconsole
FATAL: Module netconsole not found.
I'll try to find it and compile it.
2) I launch nc from a FC4 box to listen my debian server but:
nc -l -p 32711 -u
do not works because -l and -p options cannot be set a the same time. man nc says:
....
-l Used to specify that nc should listen for an incoming connection
rather than initiate a connection to a remote host. It is an
error to use this option in conjunction with the -p, -s, or -z
options.....
Patrick
[ Parent | Reply to this comment ]
On sarge with 2.6.8-11-em64t-p4-smp kernel, I do not found the netconsole module
For some reason, netconsole is not included in the 2.6.8-11-em64t-p4-smp official Debian kernel.
And the Debian nc does not complain, so your error must be distro specific.
[ Parent | Reply to this comment ]
Thanks for the tip.
[ Parent | Reply to this comment ]
I had a cheap file server that died under load. Turns out the 5V power was at 4.8V which was enough to turn it over, but a little stress on the system and it locked up. I replaced the PS and it's been running ever since.
[ Parent | Reply to this comment ]
Thank you.
BTW, this is an excellent site. I have learned a lot here, and found solutions to the issues I have faced more than once.
[ Parent | Reply to this comment ]