Architectural thinking

In my last blog I hypothesised that Architectural analysis is slightly different from developer analysis and so needs a subtly different skill set and way of thinking. To demonstrate what I mean let me describe a real life example of an architectural problem and different solutions.

I was called up late one Friday afternoon (why is it always Friday these things happen?) by a distraught business manager who’s biggest customer had a problem (I wont say who it was but they are household name in the UK). They had decided to provide a new product over the telephone and so had built a customer / order processing system for a maximum of 200 telesales operatives using Microsoft Products. They were going live on the Monday when the new product launched (and that wasn’t going to be easy to stop!). They had been in stress test for 3 weeks and when they took the load up to 50 users the system crashed. They had tried to fix the problem and couldn’t so it must be Microsoft’s fault, after all they had read in the press that Windows didn’t scale and here was proof! The Business Manager wanted me at their office (a 3 hour drive) asap, not so much to fix the problem but more to show that we were doing something. It seems to be a common misconception that putting technical people in cars or trains is a valuable use of their time which I vigorously dispute, I feel that most problems can be solved more quickly over the phone. There was a short discussion about efficient problem solving techniques, he spoke to my manager and I was in the car. Why is it my life is so like a Dilbert cartoon?

Getting to the customer a disaster scene met my eyes, paperwork everywhere, empty coffee cups, red eyed technical people, irascible managers, phones ringing, you know the sort of thing. The technical people just wanted me to say it was our technologies fault so they could go home. Managers wanted to nail me to a whiteboard and take turns with the whip; fun all around, however as I am not into S&M I insisted on looking at the application first (so maybe I am, just not that sort!).

It was a simple 3 tier app, smart client, business tier doing some business processing and a database with some simple stored procedures isolating the data access; nice and simple. There was however one strange thing; they had a second server running a piece of the business logic alongside the main business server. I asked why this was and it transpired that they had profiled the application (nice but unusual in my experience) and found one piece of code which was doing some simple customer validation and generating a GUID was taking about 30% of the CPU. They were concerned that it would become a bottleneck so had come up with the idea that, as the application was very well modularised, they could put that code on a second server and so distribute the load. They knew all about scale out.

 

The problem was that when the load got to 50 users the network stack on the server overflowed and so the system crashed. They had been on to product support and got patches to increase the network stack size (something I didn’t even know you could do!) but of course that didn’t fix the problem. Because it seemed to be something in the network layer they had spent ages in network tuning, putting in faster Ethernets and hubs etc. They were now convinced that it was an OS problem and Windows wasn’t scalable so why didn’t I admit it and let the blame fall on MS.

This is not a great career move at Microsoft and anyway I thought I knew what the problem was. I suggested a quick rebuild of the application with a simple change and then a retest whilst I went and moved the car (I had left it on double yellows). By the time I got back they had done the modifications, stress tested and were able to meet the 200 user criteria easily (which either shows how productive our platform is or how difficult it is to find a parking place in the UK!). Congratulations all round, techies treating me like a guru, senior managers fetching me coffee and a much relived business manager who carried my bag to the car, sometimes I love this Job!

So four questions:

1 What was causing the problem?

2 Waht was the fix?

3 How should it have been architected for scalability in the first place?

4 Why do I hate marketing messages?

Answers in the feedback