In my last blog I hypothesised that Architectural analysis is slightly different from developer analysis and so needs a subtly different skill set and way of thinking. To demonstrate what I mean let me describe a real life example of an architectural problem and different solutions.
I was called up late one Friday afternoon (why is it always Friday these things happen?) by a distraught business manager who’s biggest customer had a problem (I wont say who it was but they are household name in the
Getting to the customer a disaster scene met my eyes, paperwork everywhere, empty coffee cups, red eyed technical people, irascible managers, phones ringing, you know the sort of thing. The technical people just wanted me to say it was our technologies fault so they could go home. Managers wanted to nail me to a whiteboard and take turns with the whip; fun all around, however as I am not into S&M I insisted on looking at the application first (so maybe I am, just not that sort!).
It was a simple 3 tier app, smart client, business tier doing some business processing and a database with some simple stored procedures isolating the data access; nice and simple. There was however one strange thing; they had a second server running a piece of the business logic alongside the main business server. I asked why this was and it transpired that they had profiled the application (nice but unusual in my experience) and found one piece of code which was doing some simple customer validation and generating a GUID was taking about 30% of the CPU. They were concerned that it would become a bottleneck so had come up with the idea that, as the application was very well modularised, they could put that code on a second server and so distribute the load. They knew all about scale out.
The problem was that when the load got to 50 users the network stack on the server overflowed and so the system crashed. They had been on to product support and got patches to increase the network stack size (something I didn’t even know you could do!) but of course that didn’t fix the problem. Because it seemed to be something in the network layer they had spent ages in network tuning, putting in faster Ethernets and hubs etc. They were now convinced that it was an OS problem and Windows wasn’t scalable so why didn’t I admit it and let the blame fall on MS.
This is not a great career move at Microsoft and anyway I thought I knew what the problem was. I suggested a quick rebuild of the application with a simple change and then a retest whilst I went and moved the car (I had left it on double yellows). By the time I got back they had done the modifications, stress tested and were able to meet the 200 user criteria easily (which either shows how productive our platform is or how difficult it is to find a parking place in the
So four questions:
1 What was causing the problem?
2 Waht was the fix?
3 How should it have been architected for scalability in the first place?
4 Why do I hate marketing messages?
Answers in the feedback