When there is a multi-component system, there may be cases where system is working partially. There are two basic classes of problems:
· Functional Problems: These problems are where one or more parts of the systems are not providing the necessary functions. The problem is generally well defined and can be easily isolated. The problem solution approach is directed toward dependency isolation and remediation. When a functional problem arises, the troubleshooting should start with isolating the dependencies of the non-functional part. This may not be as obvious when you start. There may be hidden external dependencies that will only manifest themselves when they stop functioning. This is one of the reasons on why every component in a system should provide on their state change to a central repository together with a reason if possible. For example when Exchange store service stops, you see an entry in the event log. Searching earlier events you also see that Windows is having problems accessing a volume where your Exchange databases reside. So now that you have isolated your dependencies, you can check on the host connector cables, storage device connections etc. However most of the time, problem is not solved but converted to another problem class.
· Performance problems: These problems are where system is providing the necessary functions but the performance is not as expected. Generally finding these problems are much harder than functional problems. The primary reason for this is in functional problems you simply have a state change where in performance problems you need to take a history of the level of functions according to a given metrics. Most of harder to detect problems start as performance problems and convert to functional problems which are much more visible. However due to operational constraints root cause analysis is not carried out and once functional problems are identified they are converted to performance problems and pressure to solve them drops. In order to identify performance problems you need historical data from all related systems and correlate them to find a difference in performance and isolate the component(s) that is causing the problem. Sometimes this is easy if your systems are not affecting each other. However if you have a highly available web site, you will need to check the performance starting from your network links to load balancers to web servers and to databases. In order to see what the problem is, you need performance counters or operational logs from all systems with a common time source. This may be event log entries from your services but also can be much more detailed logs you need to collect to see inner workings of the service that is being provided. You may also need to have triggers to start/stop collecting data or you may have mountains of data to store and to analyze. If one of the components cannot provide detailed logs you may not be able to solve the problem. Next time you are buying a cheap switch/router check on data collection and reporting capabilities and decide if you want to take the operational risk with a system in question.
When you are designing a multi layered service, you need to have necessary data collection mechanisms with a common time source that can be triggered based on events and can be stored for enough time to provide the course of events leading to a performance problem. This way it will be much easier to track your performance problems when faced with hard to solve issues.