Debugging system freezes

Posted by niol on Thu 25 Jan 2007 at 10:40

Sometimes your Debian box hangs, and for a strange reason, there is no debugging information printed on your screen. What options do you have?

System logs

The first place to look for debug information is /var/log. kern.log, daemon.log, messages and dmesg often contain precious information about what went wrong. This will help you identify which hardware or even software component is causing trouble to the kernel.

Console output

Kernel oopses are usually printed in /var/log/dmesg but if the problem stalls hard drive I/O, you won't find much in the log files. And if you are running X, you won't be able to see what is printed to the console. But there are ways to get output.

The first one is to use CONFIG_MAGIC_SYSRQ which enables the magic ALT+PRINTSCREEN kernel command line interface.

Serial console

Another one is to plug a serial console, i.e another computer with a null-modem cable on the COM port, or a dumb terminal antique, to the box on which you are experiencing problems. Then, boot your kernel with the console=ttyS[X] where X is the COM port number. From the other box, you can use gkermit to open the console from the other box. This may even work using USB but I could not find how.

The netconsole kernel module

If you do not have the hardware, wich is common because most laptops do not come with a COM port nowadays, you can use the netconsole module wich is very handy. It uses very low level network device calls to send via UDP console output across your network. It is included in the standard debian kernel. Using this may help you debug anything but your network device controller driver. In /etc/modprobe.d/, add a file that reads :

options netconsole netconsole=32769@192.168.1.1/eth1,32769@192.168.1.6/01:23:34:56:78:9A

Where :

  • 192.168.1.1:32769 on eth1 is the ip/port/interface to use the send output from.
  • 192.168.1.6:32769 and mac address 01:23:34:56:78:9A is the ip/port/mac to send packets to.

On the 192.168.1.6 box, run :

$ nc -l -p 32769 -u

Then, simply modprobe netconsole on 192.168.1.1 and output should start to appear on 192.168.1.6.

More information on Using Netconsole to See Kernel Messages.

Nothing shows when my kernel hangs!

This is the worse case scenario. Linux is usually very talkative. At this point, there is a very good chance that your problem is hardware related :

  • Try to reproduce with very few peripherals connected.
  • Check you CPU temperature.
  • The odds are good that your RAM stick has defects (this is what happened to me), so try another one.
  • Do not say that you hate hardware and try to remember what it was like back in the other OS days...

Good luck, because I know this is very annoying!

 

 


Posted by PJ_at_Belzabar_Software (61.246.xx.xx) on Thu 25 Jan 2007 at 12:20
[ View Weblogs ]
"The odds are good that your RAM stick has defects (this is what happened to me), so try another one."

Did you try memtest86+?

PJ

[ Parent | Reply to this comment ]

Posted by niol (143.196.xx.xx) on Thu 25 Jan 2007 at 12:51
[ View Weblogs ]

memtest86+ is good to mention. Sorry for omitting it in the article. For those who don't know, it is a special kernel loaded with tests for your RAM hardware.

In my case, I did not have to because I had a spare RAM stick and I knew how to reproduce the freeze. Replacing the RAM stick clearly solved the problem.

[ Parent | Reply to this comment ]

Posted by daemon (155.232.xx.xx) on Fri 26 Jan 2007 at 06:55
[ View Weblogs ]
Serial console

Another one is to plug a serial console, i.e another computer with a null-modem cable on the COM port, or a dumb terminal antique, to the box on which you are experiencing problems. Then, boot your kernel with the console=ttyS[X] where X is the COM port number. From the other box, you can use gkermit to open the console from the other box. This may even work using USB but I could not find how.

AFAICR, you can't use USB for accessing a serial console (yet), which is a pain, especially for newer boards that don't include serial ports as standard (or at least, boards that provide headers, but no brackets)...

This is one of the last few remaining pieces of legacy hardware that hasn't been fully addressed.

Cheers.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.157.xx.xx) on Sat 27 Jan 2007 at 13:21

Crash dump tools like kdump (based on kexec) should make debugging freezes/crashes much easier.

[ Parent | Reply to this comment ]

Posted by begou (194.254.xx.xx) on Mon 29 Jan 2007 at 14:13
Thanks for providing this interesting doc. I just wont to mention 2 things:

1) On sarge with 2.6.8-11-em64t-p4-smp kernel, I do not found the netconsole module:
# modprobe netconsole
FATAL: Module netconsole not found.

I'll try to find it and compile it.

2) I launch nc from a FC4 box to listen my debian server but:
nc -l -p 32711 -u

do not works because -l and -p options cannot be set a the same time. man nc says:
....
-l Used to specify that nc should listen for an incoming connection
rather than initiate a connection to a remote host. It is an
error to use this option in conjunction with the -p, -s, or -z
options.....

Patrick

[ Parent | Reply to this comment ]

Posted by niol (143.196.xx.xx) on Mon 29 Jan 2007 at 16:40
[ View Weblogs ]
On sarge with 2.6.8-11-em64t-p4-smp kernel, I do not found the netconsole module

For some reason, netconsole is not included in the 2.6.8-11-em64t-p4-smp official Debian kernel.

And the Debian nc does not complain, so your error must be distro specific.

[ Parent | Reply to this comment ]

Posted by Anonymous (66.178.xx.xx) on Thu 21 Mar 2013 at 17:57
I ran into the same issue, some 6 years later. I was able to use `nc -uk -l 32769` on the receiving end (which happens to be a Mac).

And of course, you can use tee: `nc -uk -l 32769 | tee foobar.log` in order to save the output.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.141.xx.xx) on Wed 31 Jan 2007 at 11:11
This netconsole hack sounds very promising. I've got 2 terminal servers that freeze every once in a while without any specific reason and those freezes are not reproducable. I'll see what I can find out with this one as the machines are running X and I can't fall back to the console once it's not responding at all (log files are not showing anything suspicious of course).

Thanks for the tip.

[ Parent | Reply to this comment ]

Posted by Anonymous (74.68.xx.xx) on Thu 8 Feb 2007 at 23:40
On the hardware side of the house, power supplies are good to check too.

I had a cheap file server that died under load. Turns out the 5V power was at 4.8V which was enough to turn it over, but a little stress on the system and it locked up. I replaced the PS and it's been running ever since.

[ Parent | Reply to this comment ]

Posted by Freddy_Freeloader (72.24.xx.xx) on Fri 7 Sep 2007 at 00:26
This article was exactly what I needed today. We have a system that has frozen up twice without leaving a trace in the logs, and this allowed us the catch the problem.

Thank you.

BTW, this is an excellent site. I have learned a lot here, and found solutions to the issues I have faced more than once.

[ Parent | Reply to this comment ]

Posted by Anonymous (213.209.xx.xx) on Tue 30 Mar 2010 at 09:49
/etc/security/limits.conf is a good start to prevent users from starting e.g. fork bombs

e.g.
-----snip----------
@users soft nproc 100
@users hard nproc 150
-----snap----------

i would just be interested in a method to get logs about process that were not started due to this limit, anybody knows anything from this side?

[ Parent | Reply to this comment ]

Posted by Anonymous (203.212.xx.xx) on Sun 23 Jan 2011 at 07:49
how do you lower the cpu temputer?

[ Parent | Reply to this comment ]

Posted by Anonymous (81.175.xx.xx) on Sun 23 Oct 2011 at 11:25
"More information on Using Netconsole to See Kernel Messages." link seems dead.

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

What do you use for configuration management?








( 140 votes ~ 0 comments )

 

 

Related Links