There is a scene in Jurassic Park where Wayne Knight’s character, in a race to steal the dinosaur embryos by making a mad dash to the boat in a hurricane, drives the company car down an embankment and gets stuck in the mud. In his efforts to free himself he comes face to face with a small dino… and assumes it to be cute and benign. His bad. But I can understand how he could make such a mistake. As a visitor to a strange island which is also full of hundreds of plants and animals you’ve never seen before… I would probably make some tactical errors too.
There are some days where troubleshooting is a lot like being on that prehistoric island.
I mentioned in my first blog post that one of the most important things to do is to define the problem well. We begin by asking open-ended questions. “Describe the island we’re on. ” and we narrow it down “how many dinosaurs do you see” until finally we get the straight yes/no questions “Is the dinosaur biting you right now?”
But on this island there are a lot of distractors. Not all dinosaurs are Raptors or a big bad T-Rex. Some are dumb. Some are just noisy. Some are workhorses that plow through the day, taking care only for themselves. Some like to hog all the resources they can get. The same thought process can be mapped to our computer jungle. No server is ever *just* running Windows. The programs it runs may be carefully written yet hoard resources or could be a crude implementation that just steps over anything in its path. The point is, when you are new to the server environment, usually there is a lot going on, it’s all strange and you’ve got to know what you’re there to do, else you will end up running thru the jungle screaming like a little girl.
So we gather around and define exactly what we need. “Staying alive” is implied. “Fixing the server” is implied. It’s also vague. “Be able to run Application A without the server becoming unresponsive” is better. “Be able to run Application A without depleting all non-paged pool memory” is much better but not always possible. On this new island we don’t yet have all the facts on how things work. We don’t know how smart those Raptors are – yet. So with a slightly vague goal we start our mission to “Fix the server” by first gathering information to better understand our place in the food chain.
Not everyone has that seasoned guide to lead them through that jungle and sort out the issue at hand. What we do have is a multitude of tools. Machete? Check. Flashlight? Check. Radio? Check.
The danger comes at bringing along too many tools and gathering too much data. How do you sift through it all?
In my last blog I mentioned a few Microsoft tools to gather data that are pretty common. Performance monitor can, depending on how it is configured, grab a ton of data. The same with MSDT/MPS reports and memory dumps. Add on Process Monitor, Process Explorer and a ton of other tools (from Microsoft and 3rd party) and you very quickly end up with information overload. We can gather all the data we want but at some point it just becomes noise… and a lot of *Large* files to review. Sometimes it cannot be helped. For example: If you cant logon to the server remotely it could be network issues, active directory or resource depletion, etc. In such cases we need to gather as much data as possible when the issue occurs. Same goes for issues that only pop-up once every few weeks. That’s when I’ll throw as much data gathering options at it as possible.
Now we get this massive pile of data and start digging in. How do you start to break that down? This brings us to the second thing we should always have: knowledge of our limits. (think…Raptors testing the fences – know the boundaries!) What are the system requirements for Windows 2003 or 2008? Differences in how memory is handled in 32bit vs 64bit. What variables do we have set? For example: memory pool issues. Windows 2003 32 bit using the /3GB switch is going to have 128M of NonPaged pool for the system to play in. Without the /3GB switch the limit is 256M. On an x64 box it’s 75% of physical memory (RAM). If you see a value of 105 in a report – what does that tell you? Is that a danger sign or normal behavior? Knowing the limits of the system you are working on is key for you identify if the dinosaur looking you in the face is a veggiesaurus or meatosaurus. Numbers are meaningless without the proper context.
Finally, the last step in conquering the island and its hostile occupants is to connect all the dots and know how it *should* work. (On a side note: the assumption that the Raptors couldn’t open door handles was a safe bet…until proved wrong. Even in the land of zeros and ones you can get unexpected treats so I never say I’m 100% sure…because I never know when that door handle is gonna turn…)
Knowing how things should work is another can of worms all together. The majority of it is learned from observation and testing. MSDN and Technet are great resources. Knowing a bit of coding and assembly doesn’t hurt but it’s not required. I came across an example of this recently when dealing with Cached Bytes on a Windows 2008 R2 server. The concern was that the server was running out of available physical memory. I gathered a ton of data, read up on memory management and dynamic cache and stared at it for a few days trying to find out what was chewing up the memory on the server. It wasn’t until I stepped back and talked out how dynamic cache should work that I realized the behavior on the server was exactly how it was supposed to work. Memory usage increased over time, specifically in Cached Bytes, until that value hit 90% of physical memory when it was then released. By Design. There was no performance hit on the server, everything ran fine. I had been caught up in the fact that the server was using a lot of memory and adopted the mindset that using memory was a bad thing. A server using memory, as it turns out, is not a T-Rex waiting for lunch.