Thinking again about the writing course I was on the other day we were told that one of the most important things you can do to make writing more understandable is to simplify things. I think that is a good principle in general, I have been to architectural reviews where the proposed solution is so complex I doubt anyone could understand it.
I learnt about complexity of systems a long time ago when I was heading up the hardware design team for a very advanced mainframe attached parallel processor for very compute intensive applications. My team completed the hardware design and got it working on schedule and under budget but the software team doing the OS were miles adrift. Eventually they got the code to work but it was huge, some 100K lines on each of the 64 processors (in those days that was a big program!). I have to say that they were very smart guys and good programmers but the OS architecture and implementation did seem overly complex to me.
Anyway we went into integration test and the system kept crashing the mainframe. On investigation we found that it was running so slowly that sometimes the mainframe timed out. I was told it my fault and to improve the hardware performance so I wrung every nanosecond out of the design and doubled the hardware speed but it was nowhere near enough. Everyone kept saying it was the hardware causing the problem so in the end I wrote a hardware test harness which just ran the mainframe interfaces. This ran for a week without a failure so I pointed out it must be the software. I must admit to a certain amount of gloating over this, after all I had been accused as the bad guy for about 6 months. Sure enough the test shut the software guys up a treat however I didn’t think about all the ramifications.
About a week after the completion of the hardware verification test I caught flu and was at home in bed feeling dreadful when the phone went. It was my boss to say that a very important customer to whom the system had been promised had phoned up our VP and given him a roasting about non delivery. This of course had been passed down the management chain at high speed until it reached my boss who decided to fire the software guys and get me to write the code instead! After all I had managed to get it to work in test. I pointed out that I had just written a hardware test harness not a complete multiprocessor, multichannel, parallel, high performance operating. That argument rather went over his head and I got told to get a team together and write the OS in a month! The original estimate had been six months and the experienced software team of ten had been at it for over a year so it seemed like a big challenge to me, especially as I didn’t have any programmers and did have flu. Life can be so like Dilbert.
Anyway I dragged myself into work and got hold of a couple of interns that had been doing some of the hardware test coding for me. I didn’t tell them the magnitude what we were trying to do but explained the architecture of the system. I also got a couple of programmers from the customers who had been doing acceptance testing and roped them in too. We set to work with a beta date two weeks later. After two weeks of intense coding we had the code finished and fired it up. Needless to say it hung big time. Debugging OS’s, especially multiprocessor ones is very difficult; every time you try to measure something it changes the dynamics of the system and so the failure mode. In the end we spent three intense days with the listings, a white board and a lot of coffee and finally cracked the problem. Back to the system, reload the code and it all worked brilliantly. We just had one small glitch so the system hung very occasionally but we knew what that was and how to fix it.
Time was up however and we had to go into customer acceptance test. Sod’s law cut in and the glitch showed up just at the wrong moment. However because we had the customers programmers on the team and they knew we had a handle on the problem we got a sign off to ship.
There was great relief and celebrations all around. The interesting thing was that the code was one tenth the size of the original software and ran 100 times faster, it was just much simpler. We fixed the final glitch and installed the processor in the customers DP centre. My boss got a big promotion and I got a certificate. Someone please tell me this doesn’t always happen!
An interesting footnote to this story happened five years later when out of the blue I got a letter from the customer. They had been keeping availability statistics for all their IT equipment and in the whole of the five years my system had not crashed once, giving over 99.999999% (8 nines) availability. It was the most reliable system they had (including the mainframes) and they wanted to give me an award. When my management heard about this they asked me to write a white paper about architecting highly available multiprocessor operating systems. It was a very short white paper, it just said:
Keep it Simple, Stupid.
Of course it never got published, it was far too simple!