Architectural Infrastructure and Reliability

I really like architecting highly available systems; I think they are the most architecturally demanding. In fact serious performance and scalability issues often result in poor reliability or availability. The standard way to build a highly available system is to cluster however I prefer to start with a queued approach and then cluster as appropriate. The nice thing about this approach is that if something does fail then the system will restart without data or transaction loss.

To give an example of this the European Football system I talked about earlier was built with a transactional application but of course the infrastructure it ran upon was critical to the reliability of the system. As I mentioned there were about 10 servers per city and 10 cities in the system. Each server had dual Lans and was clustered, additionally there were dual WAN’s to each city and the central system (a Digital Alpha) was also clustered with a hot standby (it was a hub and spoke model system).

Basically everything was duplicated everywhere and so when I insisted on queues between all the systems the design team thought I was mad. I made lots of threats about what would happen to them if the system failed and so eventually they included queues between all the servers.

On the day of the cup final when the systems had to be running it was very stormy (nothing unusual for UK weather) however we were more interested in the system performance than the match conditions.

Monitoring from the central hub the systems were all running perfectly when we suddenly lost all the communications links! Panic! We picked up the phone but that was out too. Using a mobile phone we got through to the telecom carrier and it transpired that a lightening strike on the phone carrier’s exchange in the central city had taken out all their lines. I didn’t have a clustered exchange! I was so glad we had included the queues as the remote cities continued to process independently, writing the transactions to the queues. Things were a bit tense as we phoned around the cities monitoring the queue length and increasing it where necessary.

After a few minutes the exchange systems came back on line and the queues started to flush. Immediately the central server went to 100% and stayed there for a couple of minutes whilst the queues all cleared down, teaching me the need for throttling in queued systems very quickly. Luckily the central server kept running and the system then settled down.

So the nice thing about using queues is that they will keep your system running even when something unforeseen happens but, as is the case in most failure analysis, it’s the recovery which is the most difficult part of a high availability system. Needless to say I am a great fan of messaging and queues!