In OpsMgr 2007, it is likely that your most common alert is not really a MP based alert from a technology management pack…. it could be a built-in alert that a script failed, or WMI could not be accessed. This is because when WMI is broken on a machine, almost EVERYTHING fails to execute properly on that agent.
At a recent health check at a customer site, we found the top 5 alerts in his environment (by cumulative repeat count) were:
- WMI Probe Module Failed Execution
- Service Check Data Source Module Failed Execution
- Backward Compatibility Script Error
- Script or Executable Failed to run
- Service Check Probe Module Failed Execution
Sometimes – these alerts are normal…. the server is busy, or someone rebooted it without putting it into maintenance mode and allowing the workflows to unload gracefully.
However, if you have a high repeat count on these, it is typically indicative of something seriously broken on that agent(s). Most of the time – the failure is in WMI. Many customers get frustrated with these script errors, because they see them as “false alerts” because they don't know how to resolve the root cause, and we just tell you “this action broke”, we don't tell you why. It is critical that you examine these alerts, however, because these alerts will indicate something seriously wrong with an agent, such as broken WMI/cscript/OS issue. If you ignore them, or disable them – you will never know that monitoring is not functioning 100%.
Generally – here is how I attack script/WMI failures.
1. If the repeat count is 0 or 1, I ignore these as random failures, and close the alerts from time to time.
2. If the repeat count is very high, then something is wrong with the agent, and needs remediation on the agent OS. Investigate the OpsMgr event log on the agent for Warning/Critical events – to see if a lot of workflows are failing due to this issue.
The FIRST thing I do – is to see if WMI is responsive. I run WBEMTEST, and connect to “root\cimv2”. I then hit “query” and execute a “select * from win32_operatingsystem” to see if it returns results, or an error. Next – I look at the namespace from the alert in SCOM…. perhaps it is “root\MicrosoftDNS”, or “root\CCM”. Then – I try and run the query that is failing from the alert.
If EITHER of the above connections/queries fail…. then I know what's wrong. WMI has a core issue, and I punt this to my platform or application team to fix it. Sometimes it needs a MOF recompile, sometimes it needs WMI service bounced or the OS bounced.
If these all appear to work correctly, or, the problem is resolved after a WMI service bounce, then re-appears later – check out the following:
There are many things you can do to resolve/remediate these issues. Here is a list of the most common fixes:
1. Apply http://support.microsoft.com/kb/933061 This resolves a LOT of issues on the Windows 2003 OS with WMI. This should be one of your first steps. This applies to x86 or x64 Windows Server 2003 SP1 or SP2.
2. Registry modification for WMI buffer thresholds (see below)
“HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WBEM\CIMOM\Low Threshold On Events (B)" to 35000000 (default is 10000000)
”HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WBEM\CIMOM\High Threshold On Events (B)" to 70000000 (default is 20000000)
The registry modification to WMI buffers increases the amount of objects that WMI can hold before injecting sleep delays to the WMI service.
3. Apply http://support.microsoft.com/kb/955360 This updates the Windows Scripting Host (cscript) to version 5.7. This resolves script timeouts, and scripts consuming a LOT of CPU during execution, and problems with multiple scripts running at the same time. This applies to x86 or x64 Windows Server 2003 SP1 or SP2. This is a very good hotfix for DNS servers, DHCP servers, and Domain Controllers. This has been seen to lessen the impact of VBscripts consuming a large amount of CPU during runtime.
Making these three modifications should resolve the majority of systemic issues out there, unless WMI is completely corrupt/unresponsive and needs repair. Sometimes, rebooting a server, or bouncing WMI will temporarily resolve these issues as well, if you cannot apply the fixes immediately.
If you have applied all three of these above, and are still experiencing a systemic repeat of a WMI query/script failure…. the next step would be to try running the query directly, accessing the namespace in WBEMtest. I’d like to hear about any experiences here.