Monitoring with Munin

Posted by martijnvanb on Fri 23 May 2008 at 13:47

In this article I will describe how to install munin on 2 computers, but you can add more if you want to, this will allow us to remotely monitor system performance and activity.

Munin communicates in a client daemon way. The master-package is the munin package, it collects data from a local or remote daemon. The daemon is called munin-node the node collects data on the local machine.

munin-node will allow one or more masters to collect data to a central location where the munin master is running.

Now lets get started. First I will describe my situation.:

One computer is called aikido(192.168.1.1) On this computer we will install munin(master) and munin-node(daemon)

Aikido will be our central place for collecting data,you could call this the "server" but I think that is incorrect.

My other computer is called jiu-jitsu(192.168.1.2). On this computer we will only install munin-node(daemon)

Aikido (master and daemon)

Lets start with installing munin and munin-node on aikido:

aikido:~# aptitude install munin munin-node

Apache configuration

By default munin-graphs are public I don't like that, if you don't mind you can skip this part. I will make the munin graphs more private by protecting them with a username and password.

Add this user with:

aikido:~# htpasswd -c /etc/munin/munin.passwd munin
New password: *******
Re-type new password:  *******
Adding password for user munin

Tell apache2 to ask for a username and password for viewing muning graphs:

aikido:~# vi/etc/apache2/sites-available/munin.conf
Alias /munin /var/www/munin/

<directory /var/www/munin/>
        AllowOverride None
        Options ExecCGI FollowSymlinks
        AddHandler cgi-script .cgi
        DirectoryIndex index.cgi
        AuthUserFile /etc/munin/munin.passwd
        AuthType basic
        AuthName "Munin stats"
        require valid-user
</directory>

Enable munin.conf in apache2:

aikido:~# a2ensite munin.conf

Reload apache to activate the new munin config:

aikido:~# /etc/init.d/apache reload
Now our munin-graphs are protected with a username and password.

Configure munin(master)

Each node has to be configured in munin(master) So lets add some daemons to our system:

aikido:~# vi /etc/munin/munin.conf
[aikido]
    address 127.0.0.1
    use_node_name yes

[jiu-jitsu]
    address 192.168.1.2
    use_node_name yes

Configure munin-node(daemon)

A default node will listen on all interfaces, but will only allow clients from localhost. Lets change that to a daemon listening on his own interface(127.0.0.1).

aikido:~# vi /etc/munin/munin-node.conf
change host * to host 127.0.0.1

We made a change to munin-node.conf so we have to restart it:

aikido:~# /etc/init.d/munin-node restart
Now we are finished with the master part.

jiu-jitsu(daemon)

Lets install munin-node:

jiu-jitsu:~#  aptitude install munin-node

Lets change that to a daemon listening on his own interface(192.168.1.2) and only allow a certain master(192.168.1.1) to collect data.

jiu-jitsu:~#  vi /etc/munin/munin-node.conf
change host * to host 192.168.1.2
change allow ^127\.0\.0\.1$ to allow ^192\.168\.1\.1$

We made changes to munin-node.conf so we need to restart munin-node:

jiu-jitsu:~# /etc/init.d/munin-node restart

Plugins

By default munin-plugins are installed in /usr/share/munin/plugins If you want to enable a plugin just add a symbolic link to that plugin.

For example we might be interested in some firewallgraphs of jiu-jitsu:

jiu-jitsu:~# cd /etc/munin/plugins/
ln -s /usr/share/munin/plugins/fw* .

And perhaps some graphs about munin itself.

jiu-jitsu:~# cd /etc/munin/plugins/
ln -s /usr/share/munin/plugins/munin* .

Restart munin-node:

jiu-jitsu:~#/etc/init.d/munin-node restart

We're Finished!

By default munin gathers information every 5 minutes.

Drink some coffee or tea and point your browser to http://aikido/munin/index.html, login with your username/password and if you did everything right you should see some nice graphs of aikido and jiu-jitsu.

Some plugins require additional configuration, read information supplied with the plugins.

Thats all.

 

 


Posted by Anonymous (81.57.xx.xx) on Fri 23 May 2008 at 15:59
One should note that munin exchange traffic in clear on the wire. And as very rough ACL (just IP based). And munin-node (the thing you run on monitored hosts) run completely as root (ie. no privileges separations, even for actions that do not need privileges) (which means a lot of code, munin-node being perl (interpreter + many modules)).

Indeed, SSL support is being developped, but it's on the experimental branch, and since munin didn't released any stable version (even minor upgrade on the stable branch) for about two years, you won't see SSL on a released munin before some time...

But, appart for those oddities, munin deserves more publicity, for being a good complement to nagios. It is not designed as a good "alert oriented" monitoring tool (which nagios does well anyway), but as a good, simple and clean way to have graphics about hosts ressources (network, cpu, ram, etc) without efforts. Munin make graphing ressources *way* easier (and deployment way faster) than the traditionnal mrtg + snmp + bunch of homemade perl scripts approach.

[ Parent | Reply to this comment ]

Posted by celina12 (122.163.xx.xx) on Thu 23 Jul 2009 at 06:14
The initiative taken for the concern is very serious and needs an attention of everyone. This is the concern which exists in the society and needs to be eliminated from the society as soon as possible.
alter anchor text each time please

administration work
admin job
admin jobs
uk admin jobs
administration career
administration job

[ Parent | Reply to this comment ]

Posted by jeld (163.192.xx.xx) on Fri 23 May 2008 at 18:21
Munin is bad on many counts.
1. It depends on cron for its runs, so you cannot do any real time monitoring
2. It doesn't have good security model
3. It is rather poorly documented
4. It doesn't have much in a way of dependencies (don't alert on A if B is down)
5. It is written in perl

You should really check out zabbix. I have tried most OSS monitoring solutions, and I stand that on small scale there is nothing better then monit and on big scale there is nothing better then zabbix.

You are off the edge of the map, mate. Here there be monsters!

[ Parent | Reply to this comment ]

Posted by Anonymous (200.117.xx.xx) on Fri 23 May 2008 at 22:09
Munin is not the better tool for monitoring, right, but Munin is fantastic for stats and analysis.

I use a combination of Nagios & Munin for all my servers (> 40).
You can install/configure it in seconds.
Also write a plugin is very-very easy.
And the information that this generate, is is -very- useful.

Security model ? you can define exactly what plugin run as what user, and in that moment.

And documentation, you see http://munin.projects.linpro.no/wiki/Documentation ?

I think that it's not the better tool for monitor a server, but also I think it is the best way to see in detail that happens with the resources of a server in determinate moment.

Best,

Mark

[ Parent | Reply to this comment ]

Posted by jeld (163.192.xx.xx) on Sat 24 May 2008 at 00:04
Actually, one of the main reasons I didn't start using munin for my stuff is lack of any reasonable documentation on how to write a plugin. Maybe the situation has changed. Can you write a plugin in something other then perl?

I don't use nagios because of its 60 second minimal check interval limitation. I check some of my stuff every 5 seconds.

You are off the edge of the map, mate. Here there be monsters!

[ Parent | Reply to this comment ]

Posted by Anonymous (200.117.xx.xx) on Sat 24 May 2008 at 08:56
Yes! you can write plugin in any language!
Only import the result line.

Look this:

http://munin.projects.linpro.no/wiki/HowToWritePlugins
http://munin.projects.linpro.no/wiki/ShellPlugins

In shell script, perl, etc, etc.

Mmm about Nagios minimal check, i don't known, i'm checking all every 5min.
I will check if this change on lastest version.

Best,

Mark.

[ Parent | Reply to this comment ]

Posted by Anonymous (80.69.xx.xx) on Mon 26 May 2008 at 06:35
It`s not that difficult to write such a plugin.
We query some temperature sensors in our sever room with a simple self written perl-plugin for munin.
It works like a charm :)

[ Parent | Reply to this comment ]

Posted by celina12 (122.163.xx.xx) on Thu 23 Jul 2009 at 06:16
alter anchor text each time please administration work admin job admin jobs uk admin jobs administration career administration job

[ Parent | Reply to this comment ]

Posted by liotier (81.57.xx.xx) on Sat 24 May 2008 at 08:12
Munin may not be perfect, but it hums along with not a single problem ever and setting it up was extremely easy, even for multiple hosts and a large variety of monitored parameters. I even went as far as creating a new script using an existing one as a template and I found it quite doablz in spite of my very low competencies. Munin discreetly solves a common problem in a simple way. I hope that fact gets more recognition to that activity behind Munin grows to solve the slight shortcomings that it still has.

[ Parent | Reply to this comment ]

Posted by Anonymous (88.86.xx.xx) on Sat 24 May 2008 at 15:33
Indeed a lot of features are missing from munin, and I think I can say from the whole community who are around it. For example people don't like to release their own munin scripts like for monitoring cisco devices, because it took them too long to write those and they won't admit it. You will figure out when you start your writing your own, especially which queries snmp oids...

SSL certainly would be a nice thing to add, and parallel queries from the munin server side. I had a network with about 60 hosts, vms which is monitored with munin. There are various plugins loaded on different machines and yes it's slow. It takes about 5-10 minutes to query all the hosts on that network (for a dual 3ghz xeon box with 8g ram).

And for last munin don't have a very important part, forwarder nodes. When you have a segmented network with various subnets you have to make workarounds to reach your munin server.

[ Parent | Reply to this comment ]

Posted by localhost (41.226.xx.xx) on Sun 25 May 2008 at 17:56
[ View Weblogs ]
Munin is good for its:
* Easy installation and deployment.
* 5 minutes 'learn and code' plugins.
* Simplicity !
And not good where:
* Real time monitoring is needed (Munin permits not less than 1 minute polling interval).
* Graphs are "just a bunch of images", you cant loop into graphs for details (just like cacti).
* If the monitoring node goes down (or network down), informations about monitored nodes are lost.
*(1) There's only one possible schema: One monitoring node, Multiple monitored nodes.
*(2) There's no possibility to monitor nodes inside DMZs or private network where the monitor node cant reach, dirty architecture shall be deployed to reach these nodes, such as installing multiple monitoring nodes in every private network loosing the single interface to monitor all nodes.

(1+2): Anyone can tell me if Nagios or Zabbix can do this ?

[ Parent | Reply to this comment ]

Posted by sebg (217.128.xx.xx) on Mon 26 May 2008 at 10:26

Hey guys, if you don't like munin, it's fine but don't mix things up and check twice before writing false claims : Munin is made for statistics gathering, and not for service availability monitoring... The project is active, with plenty of documentation in the FAQ/Wiki and a great plugin directory (links below)

It may be used to forward threshold-crossing alerts to Nagios, nothing more

In the Debian world, deploying munin is dead simple (apt-get install munin-node on the nodes, apt-get install munin on the "server"), so you have plenty of time to write plugins (in any language, not only perl)

For realtime collecting, have a look at collectd, it generates fine-grained rrd databases, but no graphs (you can use drraw for this task). The plugin interface for collectd is a little bit more complex.

For security considerations, you can use iptables rules to second munin ACLs if you don't trust them, and use an administration network (most machines have at least two network interfaces nowadays) or tunnel administration traffic inside OpenVPN tunnels (this may be used to overcome firewalls/NAT/DMZs limitations)

To end, Opensource is a great world, if you don't like a program, you could :
* submit patches.
* check Trac tickets for existing wishes.
* open Trac tickets for new wishes :).
* try other programs.
* use several programs for the same task.

Documentation
FAQ
Plugin directory
TLS support
Tickets
Roadmap
Wishlist
Open tasks

[ Parent | Reply to this comment ]

Posted by endecotp (86.6.xx.xx) on Sat 31 May 2008 at 22:14
[ View Weblogs ]
Since everyone's expressing their opinion I'll jump in with mine. I'm using it on a single machine so I have no real experience of the node/server stuff.

Good:
- Graphs look good.
- Debian packages are quick to deploy.
- Plugins available for lots of things (see the munin site for a searchable archive of user-contributed ones).
- Relatively easy to write plugins.
- Can be configured to generate the graphs continuously or on demand.

Bad:
- Gives up exactly when you need it most: if the load average has reached 100, I want to be able to look at the graphs and see why. But I'll see nothing because the processes will have been killed before they finished. (I filed a Debian bug on this and it was "wontfix"ed.)
- Documentation is not great. The quickest way to write a plugin seems to be to copy an existing one, rather than looking at the docs; but then you run into the next problem.
- Code quality in plugins is not great, especially in user-contributed ones that you download.


For me, it's currently "good enough".

Phil.

[ Parent | Reply to this comment ]

Posted by Anonymous (88.86.xx.xx) on Mon 2 Jun 2008 at 01:26
For monitoring single machine Mrtg/Munin is good. But to add to your comment, I also had to monitor threshold values like a program's memory usage, or simply just monitoring that the process didn't die.
I wanted to make all of these through munin, without installing other packages on the servers, but I ended up writing shellscripts for it at the end unfortunately.

These advanced settings like define [1:30] for process monitoring are bogus and time consuming to configure and test. There was time when the proc died and it didn't alert me or in the contrary, it generated false alerts.

Then if you turn your alerts on for the crits only, you will get a lot of alerts if your machines are running SMART plugins (what you probably need). For me the number of alerts were gone up to thousands/week and I didn't got the important ones, so I decided I stop the alert function and do it on my own way. I made it with nagi+scripts.

[ Parent | Reply to this comment ]

Posted by endecotp (86.6.xx.xx) on Mon 2 Jun 2008 at 10:59
[ View Weblogs ]
In case anyone's interested, here's the plugin that I've recently written to monitor my auth.log. I did this so that I could look for any ssh key attacks after the recent catastrophe - though I'm not actually sure that the test for "failed public key" is right! Can anyone confirm?

#!/bin/sh
#
# A Munin Plugin to show auth stuff
# This version by Phil Endecott
# Original by Dominik Schulz <lkml@ds.gauner.org>:
# http://developer.gauner.org/munin/
# Based on a work of "jintxo"
#
# Parameters understood:
#
# config (required)
# autoconf (optional - used by munin-config)
#
#
# Magic markers (optional - used by munin-config and installation
# scripts):
#
#%# family=auto
#%# capabilities=autoconf


if [ "$1" = "autoconf" ]; then
echo yes
exit 0
fi

if [ "$1" = "config" ]; then

echo 'graph_title Authentication Events'
echo 'graph_args --base 1000 -l 0'
echo 'graph_vlabel Events/minute'
echo 'graph_category system'
echo 'graph_period minute'

for i in "Accepted publickey" "Accepted passwd" \
"Failed publickey" "Failed passwd" "Invalid user"
do
j=`echo $i | tr ' A-Z' '_a-z'`
echo "$j.label $i"
echo "$j.type DERIVE"
echo "$j.min 0"
done
exit 0
fi


d=`date '+%b %_d'` # auth.log space-pads the day of month, e.g. "Jun 1"
awk "BEGIN { PUBKEY=0; PASSWD=0; F_PUBKEY=0; F_PASSWD=0; INVALID=0; }
/^$d/ && /Accepted publickey/ { PUBKEY++; }
/^$d/ && /Accepted password/ { PASSWD++; }
/^$d/ && /Failed publickey/ { F_PUBKEY++; }
/^$d/ && /Failed password/ { F_PASSWD++; }
/^$d/ && /Invalid user/ { INVALID++; }
END { print \"accepted_publickey.value\", PUBKEY;
print \"accepted_password.value\", PASSWD;
print \"failed_publickey.value\", F_PUBKEY;
print \"failed_passwd.value\", F_PASSWD;
print \"invalid_user.value\", INVALID; } " < /var/log/auth.log

[ Parent | Reply to this comment ]

Posted by jamiewilliams (78.32.xx.xx) on Sat 7 Jun 2008 at 14:45
nice article. one of the issues we found with Munin was "fixing" the graph scale when there are odd spikes.

[ Parent | Reply to this comment ]

Posted by Anonymous (93.195.xx.xx) on Tue 24 Jun 2008 at 07:03
That fix doesn't get to the bottom of the problem though. It just applies a rigid graphing cap, which depending on what you're tracking you likely have to figure out differently for each machine.

Looking at the rrdtool docs, you'll see that what happens when those spikes occur is just a reset of a COUNTER to 0 being interpreted by rrdtool as a counter wrap around. That's why it throws in such huge values, because it assumes the counter hit it's limit (of 2^32 resp 2^64), which it of course didn't. It was just reset, during reboot or whatever.

The rrdtool docs give a recommendation on how to circumvent this issue, i.e. define those fields as DERIVE with minimum 0. For those, the "wrap around" algo doesn't kick in, because DERIVE counters can go up and down, while COUNTERs can only go up.

That isn't that easily corrected though, because once the RRDs are created by the plugins it would need some data migration to get your history into corrected ones. I wish plugin writers would investigate a little more thoroughly the tools they're basing their work on, so things get set up properly from the start. To date ppl don't seem to really understand what's going on there.

[ Parent | Reply to this comment ]

Posted by jamiewilliams (69.120.xx.xx) on Tue 24 Jun 2008 at 11:58
> you'll see that what happens when those spikes occur
> is just a reset of a COUNTER to 0 being interpreted
> by rrdtool as a counter wrap around.

I don't think that is the problem here. This is load average, and the value is actually going really high.

Counters being reset to zero, and use of DERIVE, would be suitable for something like tracking network usage via ifconfig. Sooner or later the network card will reset to zero.

In this case, we're actually seeing legitimate values like:

0.1, 0.2, 0.1, 5, 11, 55, 50, 49, 20, 9, 2, 1, 0.2, 0.1, ...

which isn't any reset or wrap-around.

You do need to apply this per-machine, because of the way Munin bases it's graph parameters on what the individual nodes tell it. In our experience however, once you find the appropriate "fixed" level for a specific graph on a specific machine, it tends to be fairly stable for a while.

I'm not claiming it's perfect, but it does allow you to "fix" graphs, even if temporarily.

[ Parent | Reply to this comment ]

Posted by Anonymous (93.195.xx.xx) on Thu 26 Jun 2008 at 00:16
> I don't think that is the problem here.

Well, I don't know what's the problem there at your place, but it sure is the problem here.

As said, "fake" spikes are thrown in by rrdtool on COUNTER type resets, because it assumes an overflow in case the current value is lower than the previous one. That's just part of the COUNTER definition.

So it throws in the diff between the current value minus the previous one, modulo the max which depends on counter size (32/64 bit). And this one can be huge, compared to what values you might usually see.

Common example is ADSL links with average upstream bandwidth. On that regular reset, you're likely to see a fake spike of 10Mbit or more, which simply flattens out the rest of the graph to the point of being unusable and is very annoying.

The lesson to be learned there is don't use COUNTER on measurements which sport comparatively low counter values and get regularly reset. Use DERIVE with minimum 0, which actually seems to be a better general default than COUNTER.

Of course, it'd be still better if rrdtool made that overflow assumption a tunable option. I'm not sure though it's gonna happen anytime soon.

[ Parent | Reply to this comment ]

Posted by jamiewilliams (80.68.xx.xx) on Thu 26 Jun 2008 at 12:39
Ah! Yes, using COUNTER when you wanted DERIVE is a problem.

Not being able to (easily?) change it, is a problem.

That it causes a massive "spike" in the graph, is also a problem.

But problem we have tho is when the actual data spikes. Eg, load average:

0.1, 0.2, 5, 15, 50, 49, 3, 2, 0.1, 0.1

Manually "fixing" the scale is certainly not perfect. It certainly doesn't address the underlying cause (in either your COUNTER/DERIVE situation, or in ours with actual data value spikes).

For us, it does improve the usefulness of the graphs tho.

[ Parent | Reply to this comment ]

Posted by Anonymous (93.195.xx.xx) on Sat 28 Jun 2008 at 05:52
Right, I see. Well, you could try logarithmic scale in these cases. Haven't tried them myself, but I guess they're to base ten, which basically means the first Y grid gives you values from 0 to 1, the next from 1 to 10, then 10 to 100 and so on.

As values increase, they get increasingly "compressed" on the Y axis, so you have more accuracy with lower values but still can see the peaks. As long as you and others who might use the graphs know how to correctly read those scales it might be worth a shot.

[ Parent | Reply to this comment ]

Posted by jamiewilliams (80.68.xx.xx) on Sat 28 Jun 2008 at 12:30
> You could try logarithmic scale in these cases. 
> Haven't tried them myself.

We have ;) And we updated our notes about to include details about logarithmic graphs a few days ago.

In short, they have some advantages, but it's now a lot harder to see the actual loadavg values, and if your data then lacks the spikes and contains only small values, they become mostly useless. We've opted to stick with our original solution of just fixing the y-axis scale range from 0 to 2.

We are still curious about why the Munin wiki says logarithmic "should almost never be used".

[ Parent | Reply to this comment ]

Posted by Anonymous (89.79.xx.xx) on Wed 11 Jun 2008 at 21:57
Hello,
I'm working on my BA thesis, in which I am to design a spread system monitoring tool, very similar to munin in terms of functionality and design primitives. Unfortunatelly I was unaware of munin's existence at the time I chose my subject and wanted to create something new.

Since I am anyway stuck with my subject and munin is apparently not often upgraded, I would like to create a better munin-like tool, keeping in mind the issues which were brought forward in comments to this article. I would be very grateful for any given feedback on this matter.

My goals to accomplish are as follows:
* write it in Perl (since I know it best),
* make it as modular as possible,
* flow model: every node connects to server to transmit it's data,
* maintain munin plugins compatibility,
* encrypted transmission,
* preserve node data when master node is down,
* keep it simple.

I am aware that a single man's one year work could not compare to the collective design that munin's team has delivered, although I have much enthusiasm just to try :)

[ Parent | Reply to this comment ]

Posted by jamiewilliams (69.120.xx.xx) on Tue 24 Jun 2008 at 12:07
Instead of re-implementing Munin, would it be possible to change your project to:

"Add feature X, Y & Z to the existing Munin project"

This way you can make Munin better, and not have to re-implement the wheel.

It might require a little fancy footwork talking it through with your supervisor, but fwiw:

* I'd have thought they would be more impressed that you noticed and adapted, rather than blindly following a previously plan that is now know to be a little flawed.

* Working with an existing project, team, community adds a great dimension to your project. You might be getting judged solely on your technical merit, but working with others and integrating with existing work is a very useful skill and something you could perhaps factor into your evaluation/write-up.

* You can concentrate on the really interesting, new, as-yet-unavailable bits. Like encrypted communication. And not spend half your time re-doing the dull, boring, scaffolding "framework" ground work. That is already there in Munin.

* It is much closer to many real-world projects, where you have to adapt existing systems. Greenfield projects where you get to start entirely from scratch are often the most fun, but sadly not nearly as common.

Talk to your supervisor - you never know, and adding to Munin would be a much bigger win for you, Munin and everyone else too.

[ Parent | Reply to this comment ]

Posted by Anonymous (87.106.xx.xx) on Mon 14 Jul 2008 at 17:00
I don't think he can use existing project, usually for thesis you have to produce something new from scratch and the subjects given are kinda stupid. You shouldn't overwork yourself on your thesis, just make something small and you gonna get passed, no worrying (most of the people don't put any work in theirs, still get passed). You can worry about working on bigger projects later during your career.

[ Parent | Reply to this comment ]

Posted by Anonymous (24.16.xx.xx) on Wed 6 Aug 2008 at 01:50
Munin is very good, but not perfect. The plugins model works well. The server/UI could use some work. Here's some ideas I have been considering if it was to be re-engineered.

Use http to send data from the client. Have the client save to a log file when connection down. Spawn a process to replay when connection back up.
Store the data in couchdb. You will probably need a proxy since couchdb write is currently slow.
Write a couchdb data explorer in javascript, with Dojo with v1.2 it's practically done already.

[ Parent | Reply to this comment ]

Posted by Anonymous (217.91.xx.xx) on Fri 20 Jun 2008 at 09:02
Can someone please fix this article?

This is not valid HTML and the page looks awfully:

<pre class="terminal">
<span class="prompt">jiu-jitsu:~#</span><span class="input"> cd /etc/munin/plugins/
ln -s /usr/share/munin/plugins/fw* .
<span>
</pre>

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

What do you use for configuration management?








( 669 votes ~ 10 comments )