(Asynchronously) revisiting Munin

Server and service monitoring continues to be a good thing. I've had a little Munin installation running for about nine months now, which has proven several times to be an exceptionally useful resource, especially when trying to work out if and why something is behaving differently from the way it did last week.

Having had a bit of spare time recently, I revisited my deployment, and have made some tweaks to the way I have it set up.

Previously...

My Munin deployment was set up using a variation on the "traditional" architecture: I have a designated data collector host, which periodically polls all of the other hosts which are being monitored, triggering a daemon which runs a series of preconfigured scripts which generate metrics data. The defaults are to pull metrics data through a plain old TCP connection over the network, but I set my own instance up to pull data over SSH tunnels instead.

The way I set things up back in March still works fine. However, I haven't yet managed to shake my habit of accumulating more computers (virtualised or otherwise), so it's possible that I might eventually start to have scaling issues with my Munin instance. Munin's data collection script, munin-update, runs once every five minutes from cron(8), so if collecting the data from all the monitored hosts and regenerating the graph images and HTML pages takes longer than five minutes, you could end up with multiple instances of the update script running concurrently, which introduces race conditions between multiple instances of the update script writing data to disk.

As it happens, I had some problems with this exact situation last week, as another guest on the same hypervisor as my data collector host was stealing all the I/O time, slowing all the disk writes for the collector enough to make munin-update require more than five minutes to run to completion. I decided that re-engineering my deployment a little might make it more robust in this kind of situation if it happens again in the future.

One particular part of the data collection process which can contribute to this problem is that the data serialisation performed by each host's munin-node instance is synchronous with munin-update running on the collector host. munin-node only runs the monitoring scripts when it's polled over the network, so for every host it's gathering data from, munin-update has to wait for the remote side to run all the monitoring scripts in turn and serialise the results before sending them over the network. Munin's default configuration is to poll multiple remote hosts in parallel, which makes this much less of a problem in practice, though it's still not insurmountable, as I found out the hard way.

A different take on data fetching

The Munin devs have already thought of one potential way around this, and there's a mechanism to decouple the data generation and the data fetching, by means of an "asynchronous proxy node". The idea here is to add an intermediate component between the munin-node process performing the data generation and munin-update periodically fetching that data.

There are two parts to this: first, there's munin-asyncd, which is a daemon which runs on the host being monitored alongside munin-node. It periodically polls the local munin-node instance, and then caches the generated data on disk. Then, when the collector periodically tries to retrieve data from that host, instead of connecting to munin-node directly, it invokes munin-async, which replays the cached data on disk back to the collector.

This means that the data collector no longer needs to wait for data to be serialised every time it polls, as the waiting is performed by munin-asyncd. munin-async is also specifically intended to be accessed over SSH, and invoked using an SSH forced command, which makes it ideal for my previously stated requirement of sending metrics data over a secure channel.

Putting this into practice

So I've decided to convert my Munin setup to using the asynchronous proxy, to reduce the time spent by the collector waiting on remote hosts running scripts and serialising data.

On each of the hosts being monitored, I've installed the munin-async package, which contains both the proxy daemon and the data fetching script.

# apt-get install munin-async

The package's post-install hooks automatically start and enable the proxy daemon. I've then set up the munin-async user's SSH configuration to run the data fetching script on remote connections from the data collector.

# install -d -o munin-async -g munin-async /var/lib/munin-async/.ssh
# cat > /var/lib/munin-async/.ssh/authorized_keys <<EOF
> restrict,command="/usr/share/munin/munin-async --spoolfetch" ssh-ed25519 AAAC...
> EOF
# chown munin-async:munin-async /var/lib/munin-async/.ssh/authorized_keys

The authorized_keys configuration here is a little simpler than the way I had this previously set up, as there's no need to permit port forwarding and configure which ports may be reached by the remote connection.

At this point, I can then get rid of the old munin-access user which I was using for remote access previously.

# userdel -rf munin-access

Now, I need to reconfigure Munin on the data collector, so that it makes connections to the munin-async user instead.

# cd /etc/munin/munin-conf.d
# cat webserver.conf
[webserver.example.com]
    address ssh://munin-access@webserver.example.com -W localhost:4949
# cat > webserver.conf <<EOF
> [webserver.example.com]
>     address ssh://munin-async@webserver.example.com
> EOF

And that's it

It's a small tweak, but it seems to make quite a difference. Eyeballing the self-referential graphs Munin generates about how long each data collection/processing cycle takes, separating the data generation from the data collection seems to reduce the processing time to a sixth of its previous value under normal load.


Changelog