Staying away from terminology like HPC/cluster/cloud/grid and meanings of those I use the term "cloud" because I think it's the closest to what I've got now in my prototype - it's still work in progress and it gets even more "cloudy" or change shape otherwise. There won't be any code this time - maybe when I finish it properly and have some proper performance stats - so far it's just a running and usable PoC I describe here :-)
Problem to solve
There is a significant set of problems in IT that can be solved by parallel computing. It can be data indexing, various conversions, compression, password guessing, etc - mine today is testing several implementations of a specific statistical model(s) - to make it simple, it's a mathematical game to be played. I don't have to "play" those games on one computer - I need to get the statistics over N runs, where N is really quite large number. Splitting it across several computers, playing, getting the results and then merging them - that's exactly what I need!
Don't reinvent the wheel - use MapReduce
MapReduce algorithm is well known and widely used. For example Google is using it to do their indexing for the search engine and many other tasks. They have implemented it very well as it seems from their own documents they have published on the web. Luckily my problem doesn't require large quantities of data or storage (at least so far)... All I need for this task is just a plain CPU power - the more the better!
Some of those are a must, some are "good to have" while some others are "left for version 2". Let's try to group them.
- Client/server architecture - a server to manage the tasks and scheduling, "light bulb" type clients to do the job
- Reliable way of transferring data between server and clients - that includes all code/binaries and data containing results in whatever form we might get them
- Ability to push my client to the worker node (installs and/or updates)
- Method to deliver game code to the client and send the results back to the server
- Some way to check how many work units are new/in progress/done
- The server has to be able to detect failed or overdue tasks in case the client failed for some reason (remember the "light bulb" approach)
- Support for multi-core CPUs and SMP architecture
- Ability to check client capabilities and group worker nodes by some criteria (CPU, RAM, disk, network, etc)
- Task assignment depending on node ID or capabilities
- Graceful node shutdown without interrupting tasks in progress (i.e. "when you finish currently running computations please shutdown")
Fully usable (and useful) PoC
I used Perl to code everything. My design is far from perfect, but hey - it works for me!
Server turned out to be a simple CGI and it's using only the very standard Perl modules that come with Perl on every modern Linux distro. I had to prepare a simple communications protocol - parameters here and there, how to pass data and results back and forth, how to store the data (there is no SQL database, I didn't see a need for one) and how to keep full audit log of what happened.
Client is a simple HTTP(S) client - nothing special. It checks some hardware specific information and based on that generates it's unique client ID, then it checks how many cores the node has and forks one copy of itself for every core found, then it's all pretty much standard - serial processing within each process. Each client will get the relevant about the game binary and runtime parameters from the server, then fetch binary if there is no local copy with matching hash yet (this is simply to preserve bandwidth), run a process, get results, send them back to the server, repeat. As simple as that - client is just a local wrapper for communication protocol and very simple command execution module, nothing special or complicated. K.I.S.S.
As you can see coding something like that is not a big deal but takes a bit of time to plan properly. I've already fulfilled most of the requirements above - I use GET/POST mostly but can easily expand it to use PUT, WebDav, REST or anything else, have multi-core/SMP support where available and progress tracking on the server. When client requests the new work unit, server stores the time and unique client ID which can be used later to re-queue the tasks that have failed. For that purpose I use simple timeouts - if a work unit is known to run on hardware of given class for about 30 minutes and the results are not returned in let's say 90 minutes, it's rather safe to assume that the task died, the node died or had some other problems, so we reset the unit's status and put it back in the queue to be picked up by another client. It has proven to be very valuable in the initial phase of testing, when I had a group of tasks that failed... I could isolate computers that run failed tasks and check if the reason was in my code, server code or if it was something machine specific. For version 2 I could for example monitor failures and stop assigning new tasks to servers that pass given failure threshold and flag them to for maintenance.
Making nodes run my code
Obviously it's quite easy to run experimental application if you know what the node has to offer in terms of libraries, tools, etc. To avoid the hassle of testing it or cleaning up after clients failing I went for even more "light bulb" approach - my own, simple, Linux Live CD/USB!
Very basic Debian Live has almost everything I need in fact - I just had to add ssh server to be able to remotely shut down the nodes. It boots nicely, copies whole image to RAM if possible, sets up networking with DHCP, fetches the latest client binary (packaged Perl script) with wget and executes it. All happens without my interaction, just turn the "light bulb" on and it will join the cloud. Storage is local but we don't use any hard drives (in fact I've pulled them out entirely), there is no need to install and maintain any OS - just one CD image or bootable USB stick or even better (in version 2) net-boot environment. Because it's Live I can use any PC around - be it a desktop in the office or much more powerful server in the data centre... and I won't modify any OS already installed on the box (if there is any). It's really convenient and the least invasive it can get - guys at the Debian Live did a great job! Thank you!
Another bonus is that if I need more computing power or need to finish my calculations sooner I can always turn to commercial solutions to help me out - for example by turning my Debian Live into an EC2 instance and paying per-hour for CPU time. Simply if there is a business case to finish the job sooner then the cost of extra resources is justified and accepted.
How does it perform?
We had in the office some old servers removed from one of our data centres. After quick check of some of those boxes it turned out, that we have 10 machines with two dual-core CPUs, each at 2.8-3GHz which right away gives me 40 CPU cores to play with. I'm not that technical to tell exactly what the speed benefit is, but it seems to work quite well for me. There has to be some penalty - If there is only one core used (one work unit at the time) at 100% and the unit runs for 10 minutes, then with two units running - one on each core - I usually get runtime of 10-12 minutes and get two units done. Those are very rough estimates and I didn't try to verify them yet - will do that in version 3 one day.
The most important is that we've tested one statistical model (several billions runs) in about 72 hours instead of 4-6 months and that was randomly turning on/off our servers, re-queuing units, etc! Development and testing time was only about 2-3 days including building and testing my Live CD.
What problems I've encountered so far?
First of all I forgot how quickly an ill optimised app can fail. 40+ clients can poll the server quite often (like every 1 minute) and if you don't use a DB to store state, then you have to do locking yourself... or do it lock-free somehow - use your brain!
Power and heat - when I've plugged in some of those servers into the same UPS we have for the comms room (rather big one but already used by one of the racks) it got overloaded... I was surprised how much energy you can save by sliding HDDs out! Later I've found better way to power those up - we have separate and unused power supplies at the bottom of the rack and they are heavy duty so here we go. Then the temperature has become a problem. Ten 1RU servers idling in rather small room create well enough heat, heat that our small AC unit can't cope with. I try to keep the temperature below 20 Celsius with total maximum at 22. With five servers idling we are at about 18.9-19 degrees, seven boxes is 19.6, 10 is well over 20. When they start working... oh my.. oh my... we get to 26-27 in minutes! To make a long run I have to limit myself to 5-6 servers max. Seems like adding thermal controls to the server code will be the very first new thing in version 2...
So that's all folks - at least for now. Back to the drawing board - I hope to have version 2 running by early May.