Monday, 30 June 2008

Soekris net4801 as Nagios powered network monitor

Some time ago (rather long long time ago) we have decided to purchase some small device to turn it into very portable server, that we could send to one of our friends to host. The whole purpose was to get Nagios on it and to monitor our sites from outside of our networks. To some people it may sound crazy, but it makes kind of sense - how many times you have heard from someone "it works on my computer"? Too many times?

The goal is to know when my (and possibly why) visitors/customers can't reach my servers and to be able to diagnose if that is local to some location or network part or it affects wider audience. Up to some point remote sensor answers that question - at least from a perspective of his particular location.

After looking around the net we've decided to get one of those famous Soekris kits.

Was it a good choice as a hardware platform? How will it scale when the amount of monitored systems will reach certain level? Let's see where it got us so far as the system is live for about a year now.

Under the bonnet
We got Soekris net4801-60 which means our box has:
  • 266MHz Geode CPU
  • 256 MB SDRAM
  • 3x Ethernet
  • 2x Serial
  • 1x USB connector
  • 1x CF socket
  • 1x 44 pins IDE connector
  • 1x Mini-PCI socket
  • 1x PCI slot (3.3V)
Everything came nicely manufactured and put together - really nothing to complain about. At the back you can see the rest of the ports available. To be honest, I always have problems which side is the back - the one with nice LEDs or the one that has all the ethernet ports :-)

Disk storage

There was a printed manual attached, that describes how to start using the box. In fact to do anything you have to open it and put some storage inside. To build a system that we can move around freely, without any risk of damaging etc, I've decided to go for a CF card as storage medium and use the on-board CF card slot. Of course you can also install 2.5" HDD drive but that requires additional mounting bracket. Anyway 2GB space for Linux is well enough!

Choosing operating system

I was strugling here for a long time... Manuals 'recomended' BSD but because I didn't really have time to mess around, my choice was somewhere between Debian and Ubuntu, which except the name is almost the same. Finally my choice went for Ubuntu Server which is clean and closed by default - if you don't tell it you want ssh to run, you won't have any ports open at all. I wanted something really minimalistic and I got it.

Installation procedure

Because I don't really have a desktop PC at home, where I could put my CF card and install OS and I didn't want to do it in the office, I had to go for network install. There is an awful lot of websites that describe how to do it so I will skip this part here - all I can tell is that it takes very long time... and it's just because CF cards are not the fastest medium to write to. One thing you have to keep in mind - choose i386 version of the kernel or it won't boot! By default Ubuntu installs i686 optimized kernel if I remember well.

System hacks/optimizations

First of all - CF cards can be real pain in the back! Usually they don't play well with DMA - having DMA enabled usually (in my practice) causes the system to freeze for over 1 minute during the boot. Another problem is memory wear out - CF cards as every other flash memory have a certain amount of write cycles you can use. Of course there are wear leveling mechanisms in place to make sure you write block rather equally but still that is a problem and by default Linux is not very flash friendly (and as you will see Nagios is even worse).

First thing I've added to /etc/fstab was some options (see man mount for details):
  • commit=300
  • noatime
  • nodiratime
and kept looking at iostat results. Those parameters gave visible difference, so I left them in this form - had no more time to experiment more, had to get the Nagios running. Anyway I'm a bit concerned about the results in a long run...

# uptime
 08:21:49 up 63 days, 17 min,  1 user,  load average: 0.59, 0.50, 0.26

# iostat
Linux 2.6.22-14-386 (sar)       30/06/08

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.69    6.60    5.06    0.03    0.00   86.62

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hdb               0.13         0.15         5.33     821860   29026872

As you can see, blocks are still written way too often - but now I know where it comes from :-)

Nagios - powerful monitoring software

Installing Nagios was easy - apt-get it and you are done. Configuration is another issue. Better read documentation or some book about Nagios. It is so flexible that if you don't get a grasp of basic concepts you will get yourself into trouble well before you finish describing your netowrks.

Few hints on setting it all up - those that took some time to learn from my mistakes:
  1. The most important part that I didn't do properly... put all often rotating files into RAM instead a normal directory located on CF card! Every test or plugin run writes output to a file (ouch!), then every state update is written to a file as well - that will wear out CF card well too fast!
  2. Pick reasonable intervals for tests - that will affect overall performance a lot - more on that a bit later.
  3. Set status_file, object_cache_file and temp_file (this one uses a lot of I/O operations) to in-memory FS like /dev/shm/
I didn't do that properly, so you can see the wear on the CF card in listing above. We'll see how it works now... it should also increase overall performance. If reads from CF are ok, writes are rather slow and having over 40 servers and well over 100 services monitored leads to plugin timeouts and false positives. I often got emails and texts saying that there is a problem somewhere, just because there was a plugin timeout.

SMS - text messages that will get you... wherever you are!

In my opinion the most reliable way of alerting staff about problems is text messaging or SMS as it's known in other countries. You can integrate it very easy with nagios by writing your own commands. All you need to do is to find a provider that will give you API to send texts and to be honest that's terribly easy! I've found a company we can use on pay-as-you-go basis and we bought a small package of texts. Then I wrote a simple script in Perl that uses their API to send texts, followed by Java app that did the same (just for fun to see how hard it will be in Java - ehmmm... that's for another post) and at the last spin I got to one of native nagios test plugins to send texts using HTTPS call. That is the flexibility I was talking about!

BTW - make sure you tune it well, otherwise people will get annoyed with texts and start ignoring them and for sure, will ignore the important one that is a real alert.

Other ideas on how to use net4801Just a few ideas I have:
  • Install wi-fi card (Atheros based) and install Mikrotik on it - except the fact that you can get RouterBoard with Mikrotik much cheaper (I have one myself).
  • Turn it into a remote 'light-bulb type' wi-fi sensor - for example Kismet drone
  • Build a portable firewall - let it be a transparent bridge with NAT-like behavior...
  • In-car computer system? Why not - it's a PC running Linux/BSD system, so you can build what you want, especially because it can take 6-20V DC max 15W
  • This book will give you some more ideas... and show a few more small form factor platforms to play with :-)

My overall opinion about net4801

It is a mid-range box, nothing too special, nothing really that magic as one can think. It's really reliable, small and quiet! Great stuff to mess around with, play with software, use for various projects. I use it currently for Nagios and I'm quite happy with it, despite it gave me a lot of headache to tune it properly (more Nagios than Soekris). Finally I have a bit of time to build a second nagios host in the office, so this one will got to another location as a remote sensor.

The bottom line here is anyway that net4801 is not a speed demon, so before you buy it, make sure you have at least rough estimates of load your installation will have to handle or you may end up with quite expensive toy you can't really use to solve your problem. I'm really happy with the one I have and now I know for sure, that the problems I had are coming from improper (suboptimal) use of CF card for storage. I will play with Nagios parameters a bit more soon and possibly change the CF for HDD - just to see how it compares. I bet that Soekris can do much more than what I get now :-)

Have fun with your projects!