The new DNS Management Pack for System Center Operations Manager

The most recent release of the DNS management pack for SCOM is a total re-write of the management pack. 

https://www.microsoft.com/en-us/download/details.aspx?id=37141

During the design, development, and testing process we worked with the Windows DNS feature team, the System Center team, MVP's, and perhaps most importantly real System Admins running real world DNS servers in production environments.  Here were some of the common complaints we heard from real-world DNS System Admins:

  1. I don't understand why I got this ticket.  SCOM says DNS is broken, but I don't understand how it made that decision.
  2. Why did I get this ticket?  I thought we disabled that rule in SCOM.  (We would later learn that we had disabled it for Windows 2003 DNS but since the rules for Windows 2008 DNS were in a different management pack, the override did not apply.)
  3. Why does SCOM say every in-addr.arpa forwarder we have is broken?
  4. Why didn't SCOM alert on this failing DNS server?  (A problem with WMI had silently prevented discovery from working.)

All of these complaints get back to trust.  The System Admins did not trust that SCOM would give them tickets with enough information to allow them to solve a problem.  And they did not trust that a working SCOM agent would tell them when there really was a problem.  We needed to re-design the management pack to make is very transparent and regain the trust of the people receiving the tickets.

We have been testing and improving the new DNS management pack in our Production environment for a couple of months now.  The System Admins are generally much happier with their tickets.  They usually understand exactly what triggered the alert.  And if they don't, the analysts in the Operations Console can usually clearly explain it to them by glancing at Health Explorer.  I also used SSRS to make a "dashboard" of sorts that allows them to see the status of every IP address targeted by every forwarder.  This has allowed them to clean up our forwarders and solve some minor networking issues that didn't warrant a high priority ticket.  In general the new MP allows System Admins to much more clearly understand what the MP is telling them.  They also know that any monitoring changes they request can be applied to all versions of DNS with a single override.  (Although the MP does provide version-specific groups if we ever need them.)

Here some of the more significant changes in the new management pack:

Discovery

A chicken-and-egg problem with WMI

In 2009 Kevin Holman observed that the DNS Management Pack behaved unpredictably on Windows Server 2003.  He posted a work-around that would keep the DNS WMI provider loaded.

https://blogs.technet.com/b/kevinholman/archive/2009/06/29/errors-alerts-from-the-dns-mp-script-failures-wmi-probe.aspx

But this raised a larger concern.  The entire DNS management pack has a deep dependency on WMI to properly function.  But if WMI isn't working correctly, the almost all of the discoveries, all of the tasks, and many of the monitors may not work.

To break this chicken-and-egg cycle the new management pack uses a simple registry based discovery of HKLM\SYSTEM\CurrentControlSet\Services\DNS\Start to detect the DNS Server Computer Role.  This role has a monitor on it that serves two functions.  First it will raise an alert if the root/MicrosoftDNS WMI provider is not working properly.  Second, it runs frequently enough that it will serve as the keep-alive on Windows 2003 DNS Servers.

One management pack for Windows 2003, 2008, 2008 R2, and 2012 DNS

The instrumentation of DNS has changed very little between Windows 2003 and Windows 2012.  The events, performance counters, and WMI providers are essentially the same.  So the new management pack is the SCOM management pack for Windows 2012 by default.  It can also optionally replace the existing management pack for Windows 2003, 2008, and 2008 R2, found here: https://www.microsoft.com/en-us/download/details.aspx?id=12973  To prevent customers with investments in the existing management pack from being forced into an upgrade the discovery rule that discovers Windows 2003, 2008, and 2008 R2 DNS servers is disabled by default.  An optional management pack is provided with a single override which enables this discovery.

So if a customer wishes to only use the new management pack to monitor Windows 2012 DNS servers, the optional management pack need not be loaded.  Or if a customer wishes to use the new management pack to monitor older versions of DNS, that feature can be easily enabled.

Rules

Most of the monitors in the old DNS MP that had an event-based reset have been converted back to rules since the events used to reset the monitors were not always logged when a failure was corrected.

The performance and event collection rules are all disabled by default.  They can be enabled in bulk by using the optional override management packs, or selectively as customers desire.

Monitors

Added support to the NSLookup data source for PTR and CNAME records.

in-addr.arpa forwarders are now exercised with a PTR query rather than an A query.

Monitors that use NSLookup now return the command line called, STDOUT, and STRERR in the property bag to make interpreting results in Health Explorer easier.

The NSlookup data source no longer disregards some parameters based on other parameters.  It takes only the parameters that it needs, and uses them all.

Forwarders that target multiple IP addresses now test each IP address individually, but only alert if all of them are failing.

General

Script Simplification

The old DNS management pack used a mix of VBScript and JScript.  The new management pack uses only VBScript.

Most of the scripts are significantly shorter, allowing for easier debugging.  Removed functions that were contained in the scripts but never called.

Cosmetic Changes

Object display strings now more closely match their system names.  This should make diagnostics easier for those customers that read the SCOM database directly.

All monitors, rules, and discoveries have their enabled properties set to either True or False.  "onEssentialMonitoring" could be confusing to some customers.

Alert generating rules have their category set to "Alert."

All rule-based alerts now contain the event ID, event source, and event log that generated the alert in the description.  This makes tickets cut from the alert description much more self-explanatory.

Included properties in most script-based monitors to clearly show in Health Explorer what was attempted, and what the raw results were.