Monday, 7 July 2008

Tuning Nagios for running off CF Card

As a follow up to my previous post I've run my Nagios installation on Soekris net4801 implementing the advice I've given you in my last post (focusing on slow I/O when writing to CF Card), describing the platform and what can be done with it. The changes in system behavior are huge - in a positive way of course.

First of all the system is not so overloaded now and I guess I could double the amount of tests run on this platform without getting into trouble like before. At the moment this system is monitoring 36 machines with 86 services in total. Some time ago I had to stop adding and literally remove some less important tests, because most of the time I was getting false positives - usually warnings, with comment that the plugin has timed out. So how big is the difference?

Literally quite substantial!

Those values and their meaning are described in Nagios documentation and for more detailed stats you can simply click on 'Monitoring Performance' text to get something similar to the screen below.

This is just info for active service check, because as you see from the first screen I don't use passive checks at all in this setup and host checks play really minor role here with average check time of 0.15 sec.

Tuning Nagios for speed and 'CF Card friendliness'

In fact I have changed only one thing - moved the nagios status file to /dev/shm which is in-memory file system, just to reduce the amount of physical writes to CF card. This file is the most written one, so a change like that should have a huge result... an here we go, a bit over a week after changing the settings:

$ iostat
Linux 2.6.22-14-386 (sar)       08/07/08

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.81    5.29    4.57    0.00    0.00   88.32

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hdb               0.02         0.28         0.19     167644     116904

To be honest - that is something! Blocks written were cut down from 5.33 block/sec to 0.19 block/sec which means reducing the amount of writes over 28 times!

To compare better - Nagios results before changes

Before the change minimum execution time on active service check was about 2.2 sec, maximum reached out to over 12.1 sec (in my setup anything longer than 10 sec gives warning - there is not much point in waiting for longer than 10 sec to see if ping or other test comes back) and average run time was about 6-8 sec (floating) depending on the current situation.

It is worth to point out that Nagios has some smart load leveling algorithm, that spreads test over some time to reduce load on the machine. It works, but sometimes long running tests had cumulated and that was giving timeouts even on simple tasks as ping sweep. Another interesting thing - if you don't really really need full ping test, change the command that does the pinging to use check_fping plugin instead of check_ping - this way I could reduce the test time from average of 4.5 sec to well under 0.3 sec - a change of that size cuts out most of timeout warnings even if the tests get to run at the same time. Check times described above are for a system that already runs check_fping.

Summary - what those numbers really mean?

Well - numbers are just numbers but in plain english they mean, that I no longer get a plugin timeout messages. I haven't seen any timeouts since the status file was moved to /dev/shm/ and that is great. No false positives, system responds faster... and you know what? Several new servers and plenty new services are coming and they will be added to this Nagios installation, so I'll let you know if/when I get into trouble again.