Posted by simonw on Fri 27 Jan 2006 at 13:40
Many of the higher end servers have an Intelligent Platform Management Interface, that lets you observe a whole host of hardware parameters. Usually these systems also support plug-in remote management cards (for example DELL RAC cards), that allow remote resets, and other remote diagnostics.
This software use to be a pain to install, as it required kernel patches or extra modules, but we needed some thermal monitoring added in a hurry here due to air conditioning problems, and it seems it is now much simpler.
On DELL 2650 running Debian Sarge with 2.6 stock Sarge kernel;
# apt-get install ipmitool # /usr/share/ipmitool/ipmi.init.basic # ipmitools -I open sensor list
If these two command work, and produce useful output, all you need do it make it work after the next reboot as the device file created by the init script may need a different major deive number, and find some way of handling the output. The tools allow network management. For reboot I went with the old /etc/rc.boot directory, just sticking the ipmi.init.basic script in there (See /etc/init.d/rcS).
For monitoring we've gone with a simple Perl script to check everything is okay, and page us if it isn't, tested it by setting the upper non-critical (unc) temperature threshold below ambient temperature.
ipmitool also lets you adjust the thresholds, we figured early warning of temperature issues is kind of important to us right now.
So we tweaked down the non-critical thresholds.
ipmitool -I open sensor thresh "ESM Frt I/O Temp" unc 40
IPMI also allows watchdog checking for operating system crashes, but I'll likely ignore that for now, crashes really aren't a big problem.
Anyone familiar with this technology going to tell me what I should have done? And how it fits with the other free software for such tools?
This article can be found online at the Debian Administration website at the following bookmarkable URL:
This article is copyright 2006 simonw - please ask for permission to republish or translate.