Effective Troubleshooting

Hi everyone. It’s Mark Renoden here again and today I’ll talk about effective troubleshooting. As I visit various customers, I’m frequently asked how to troubleshoot a certain problem, or how to troubleshoot a specific technology. The interesting thing for me is that these are really questions within a question – how do you effectively troubleshoot?

Before I joined Premier Field Engineering, I’d advanced through the ranks of Commercial Technical Support (CTS). Early on, my ability to help customers relied entirely on having seen the issue before or my knowledge base search skills. Over time, I got more familiar with the technologies and could feel my way through an issue. These days I’m more consciously competent and have a much better understanding of how to work on an issue – the specifics of the problem are less important. The realisation is that troubleshooting is a skill and it’s a skill more general than one technology, platform or industry.

I’d like to draw your attention to an excellent book on the topic –

Debugging by David J. Agans
Publisher: Amacom (September 12, 2006)
ISBN-10: 0814474578
ISBN-13: 978-0814474570

In his book, Agans discusses what he refers to as “… the 9 indispensable rules …” for isolating problems. I’ll be referring to these rules in the context of being an IT Professional.

Understand the System – Debugging, Chapter 3, pg 11

In order to isolate a problem, Agans discusses the need to understand the system you’re working with. Consider the following.

Purpose – What is the system designed to do and does this match your expectation? It’s surprising how often an issue has its roots in misunderstanding the capabilities of a technology.

Configuration – How was the system deployed and does that match intentions? Do you have a test environment? If you have a test environment, you can compare “good” with “bad” or even reproduce the issue and have a safe place to experiment with solutions.

Interdependencies – This is an important thing to understand. Take the example of DFSR – where there are dependencies on network connectivity/ports, name resolution, the file system and Active Directory. Problems with these other components could surface as symptoms in DFSR. Understanding the interplay between these “blocks” and what each “block” is responsible for will greatly assist you in isolating problems.

Tools – It could be argued that tools aren’t part of the system but without knowing how to interrogate each component, you’re unlikely to get very far. Log files, event logs, command line utilities and management UIs all tell you something about configuration and behaviour. Further to this, you need to know how to read and interpret the output. Your tools might include log processing scripts or even something as obscure as an Excel pivot table.

If you don't know how the system works, look it up. Seek out every piece of documentation you can find and read it. Build a test environment and experiment with configuration. Understand what “normal” looks like.

Check the Plug – Debugging, Chapter 9, pg 107

Start at the beginning and question your assumptions. Don't rule out the obvious and instead, check the basics. More than a few issues have dragged on too long after overlooking something simple in the early stages of investigation. Can servers ping each other? Does name resolution work? Does the disk have free space?

Do your tools do what you think they do? If you have doubts, it’s time to review your understanding of the system.

Are you misinterpreting data? Try not to jump to conclusions and try to verify your results with another tool. If you hear yourself saying, “I think this data is telling me …” find a way to test your theory.

Divide and Conquer – Debugging, Chapter 6, pg 67

Rather than trying to look at everything in detail, narrow the scope. Divide the system into pieces and verify the behaviour in each area before you get too deep.

  • Does the problem occur for everybody or just a few users?
  • Is every client PC affected or those in just one site?
  • What’s common when the problem occurs?
  • What’s different when the problem is absent?

When you’ve isolated the problem to a specific component or set of components, your knowledge of the system and the tools you can use to gather detail come into play.

Given a known input, what’s the expected output for each dependent component?

A great suggestion discussed by Agans is to start with the symptoms and work back towards the problem. Each time you fail to identify the issue, rule out the working component. This approach is highly useful when there are multiple problems contributing to the symptoms. Address each one as you find it and test for success.

Make it Fail – Debugging, Chapter 4, pg 25

Understanding the conditions that reproduce the problem is an essential step in troubleshooting. When you can reliably reproduce the symptoms, you can concisely log the failure or focus your analysis to a specific window in time. A network capture that begins immediately before you trigger a failure and ends immediately after is a great deal easier to work with than one containing a million frames of network activity in which perhaps twenty are useful to your diagnosis.

Another essential concept covered by Agans is that being able to reproduce an issue on demand provides a sure fire test to confirm a resolution, and that this is difficult if the problem is intermittent. Intermittent problems are just problems that aren’t well understood. If they only occur sometimes, you don’t understand all of the conditions that make them occur. Gather as many logs as you can, compare failures with successes and look for trends.

Quit Thinking and Look – Debugging, Chapter 5, pg 45

Perception and experience is not root cause – it only guides your investigation. It’s essential that you look for information and evidence. As an example, I recently worked on a DFSR issue in which huge backlog was being generated. After talking with the customer, we had our suspicions about root cause but as it turned out, a thorough investigation that combined the use of DFSR debug logs and Process Monitor revealed there were two root causes, both of which were nothing to do with our original ideas.

Only make changes when the change is simpler than collecting evidence, the change won’t cause any damage and when the change is reversible.

Consider data gathering points in the system and which tools or instrumentation expose behaviour but also take care that using tools or turning on instrumentation doesn’t alter the system behaviour. Time sensitive issues are an example where monitoring may hide the symptoms.

Don’t jump to conclusions. Prove your theories.

Change One Thing at a Time – Debugging, Chapter 7, pg 83

Earlier I suggested having a test environment so you could compare “good” with “bad”. Such an environment also allows you to narrow your options for change and to understand possible causes for a problem.

Whether you’re able to refine your list of possibilities or not, it’s important to be systematic when making changes in the system. Make one change at a time and review the behaviour. If the change has no effect, reverse it before moving on.

Another consideration is whether the system ever worked as expected. You may be able to use change records to identify a root cause if you have a vague idea of when the system was last working.

Keep an Audit Trail – Debugging, Chapter 8, pg 97

Don’t rely on your memory. You’re busy – you’ll forget. Keep track of what you’ve done, in which order and how it affected the system. Detail is important and especially so when you’re handing the issue over to a colleague. During my time in CTS, we’d pass cases between each other all the time and sometimes without a face to face handover. Good, detailed case notes were important to a smooth transition.

Get a Fresh View – Debugging, Chapter 10, pg 115

Talk the problem through with a colleague. I’ve had many experiences where I’ve realised how to tackle a problem by just talking about it with another engineer. The act of explaining the facts and clarifying the problem so that someone else could understand it gave me the insight needed to take the next step.

Don’t cloud their view with your own interpretation of the symptoms. Explain the facts and give your colleague a chance to make their own conclusions.

Don’t be embarrassed or too proud to ask for help. Be eager to learn from others – the experience of others is a great learning tool.

If You Didn’t Fix It, It Ain’t Fixed – Debugging, Chapter 11, pg 125

Check that it’s really fixed and try to “make it fail” after you’ve deployed your solution. Reverse the fix and check that it’s broken again. Problems in IT don’t resolve themselves – if symptoms cease and you don’t know why, you’ve missed key details.

- Mark “cut it out with a scalpel” Renoden