Crazy as it sounds, one of the things we forget when troubleshooting is often one of the most basic things we should know: How is it really suppose to work?
I recently worked an issue regarding performance counters where the SQL performance counters were not loading. If I wanted I could have just whipped out this article and been done with it:
But that's just no fun.
Let's assume we know nothing about Performance Counters:
What are they? Counters are just a way to see how well something is performing. Much like trying to figure out what that one guy in the office, who sits in the corner and doesn’t talk to anyone, really does all day. And, whatever it is...does he do it well? Or does he take 50 minute smoke breaks? Does he drink too much of the office coffee? Steal paper from the copier?
Counters are the operating system's way of checking up on certain things, such as OS behavior, application or program behavior. Everything that shows up in Performance Monitor comes from performance counters.
Where can we find them? Information for individual counters can be viewed in the registry under HKEY_LOCAL_MACHINESystemCurrentControlSetServices<service>Performance. Performance Monitor is one tool that displays the data gathered from those counters.
Side note: It's interesting to note that the information for performance counters does not actually exist in the registry hives - it's read from other sources and displayed via the registry. So, for example, if you attempt to add Everyone group with Full Control to security permissions for performance counter information here: HKEY_LOCAL_MACHINESoftwareMicrosoftWindows NTCurrent VersionPerflib 09
You may get an error like:
Unable to save permission changes on 009.
The handle is invalid.
That error is because the registry is reading the information from somewhere else and displaying it. As the saying goes... Dont kill the messenger - or in this case make the message more powerful by giving the messenger a sword.
Now that we have some basic information we can dig a bit deeper:
Are there different types of counters?
Windows has a base set of performance counters. Then there are Extensible Counters - or add-ons to the base. SQL and Exchange are examples of applications that can add Extensible counters. Most counters will reside here:
And you can view information in the registry here:
HKEY_LOCAL_MACHINESoftwareMicrosoftWindows NTCurrent VersionPerflib 09
But, just to be tricky, some applications may hold their counter information in other locations.
First, it is good to know which counters you are working on… Base, Extensible (located in the standard system32) folder or Extensible counters located elsewhere.
I was working on counters located in different locations.
Second, do a little research. Find what tools are commonly used to troubleshoot counters and reinstall them. Lodctr and exctrlst are two that come to the top of the pile.
Third, test on your environment. With some generic articles in hand, I tried breaking, then fixing my counters on a test machine only to find I couldn’t fix them! After a bit more digging, running process monitor, and head banging I realized I was running the lodctr command from the wrong folder. The counters I wanted to load were not from the INI I was using. Specifically, the stinker I was testing on was the Windows Internal SQL database counters located here:
My error was in assuming the associated ini file was in this folder:
C:Program Files (x86)Microsoft SQL ServerMSSQL.1MSSQLBinn
(command: lodctr perf-MICROSOFT##SSEEsqlctr.ini)
It turns out the sneaky thing was here:
Once correctly located, the lodctr command worked like a champ and my counters returned as expected - for me. Now that I was able to break and fix my counters at will and repeated the steps several times I started to get a feel for how it really worked. The question now was - why couldn’t we get it working on the problem server?
Side note: I always try to reproduce the steps on my own system. In this case I built a VM of Windows 2008 with SQL 2005 standard. Got a baseline system running, took a snapshot of that VM (in case I really screwed thing up) and then went to town. Getting this up and running may seem a bit time consuming but my level of understanding of the issue improved 90% and prevents shotgun troubleshooting on live systems.
Taking the knowledge that I screwed up my own resolution quite easily (I'm my own worst customer) we went back to the production server and triple-checked that we were in the right folder, typing the right command with the right filename. (In my experience syntax errors have been the root cause of a large number of issues.) We even took a process monitor trace of the problem server and my test server. And there was the difference. It was in syntax. Seems we were using Lodctr with the /T: switch on the problem server.
This spins up the thought: How is Lodctr suppose to work and how is Lodctr /T: suppose to work? What is the difference? What is the difference in any command line switch? And how do you find out?
Trick#1: Always try "<command>.exe /?" or /help in some cases. You should get output with some description.
Wait… What? What does setting "the performance counter as trusted" really mean?
Trick#2: I like to hit Technet or MSDN and check there for additional information on the commands
And if that is not enough information explore the Knowledge Base:
At this point a basic explanation out there regarding the difference is there in Lodctr and Lodctr /T:? Once loads the counters from the *.ini file specified and the other syncs up the size and date from the file to the registry. I imagine there is more information out there and, if not, I could always reverse engineer it a bit further. But in this case we finally got the counters on the problem server working.
In my search for information I typically hit these three areas first:
- Knowledge Base and blogs: Very good for error messages and information on broken behavior
- Technet: great information for how the product or application was designed to work
- MSDN: for use when you want to find out what's happening at the coding level.
Each of the three parts above; Determining how performance counters work, troubleshooting reinstalling performance counters and researching the tools we use in troubleshooting the counters, have the same thought process flow:
- Define the problem/know what you're working on
- Search for information related to the problem
- Set it up/Break it/reproduce the issue in a test environment
Sure, you can skip step 1 and 3 and get the job done… but what the fun in that?