***Note: This post is edited to reflect the newly shipped MP version 6.0.6958.0
Get it from the download center here: http://www.microsoft.com/download/en/details.aspx?id=9296
This really looks like a nice addition to the Base OS MP’s. This update centers around a few key areas for Windows 2008 and 2008 R2:
- Adds Cluster Shared Volume discovery and monitoring for free space and availability. This is critical for those Hyper-V clusters on Server 2008 R2.
- Adds a new monitor to execute the Windows Best Practices Analyzer for different discovered installed Roles, and then generate alerts until these are resolved.
- Changes to many built in rules/monitors, to reduce noise, database space and I/O, and increase a positive “out of the box” experience. Also added a few new monitors and rules.
- Changes to the MP Views – removing some old stuff and adding some new
- Addition of some new reports – way cool
Let take a look at these changes in detail:
Cluster Share Volume discovery and monitoring:
We added a new discovery and class for cluster shared volumes:
We added some new monitors for this new class:
NTFS State Monitor and State monitor are disabled by default. The guide states:
- This monitor is disabled as normally the state of the NTFS partition is not needed (Dirty State notification).
- This monitor is disabled as it when enabled it may cause false negatives during backups of the Cluster Shared Volumes
I’d probably leave these turned off.
The free space monitoring for CSV’s is different than how we monitor Logical disks. This is good – because CSV’s are hosted by the cluster virtual resource name, not by the Node, as logical disks are handled. What CSV’s have is two monitors, which both run a script every 15 minutes, and compare against specific thresholds. Free space % is 5 (critical) and 10 (warning) while Free space MB is 100 (critical) and 500 (warning) by default. Obviously you will need to adjust these to what’s actionable in your Hyper-V cluster environment.
BOTH of these unit monitors act and alert independently, as seen in the above graphic for state, and below graphic for alerts:
Some notes on how free space monitoring of CSV’s work:
- Each unit monitor has state (critical or warning) and generate individual alerts (warning ONLY)
- There is an aggregate rollup monitor (Cluster Share Volume – Free Space Rollup Monitor) that will roll up WORST STATE of any member, and ALSO generate alerts, when the WORST state rolls up CRITICAL. This is how we can generate warning alerts to notify administrators, but then also generate a new, different CRITICAL alert for when error thresholds are breached. I really like this new design better than the Logical Disk monitoring…. it gives the most flexibility to be able to generate warning and critical alerts when necessary. Perhaps you only email notify the warning alerts, but need to auto-create incidents on the critical. The only downside is that if a CSV volume fills up and breaches all thresholds in a short time frame, you will potentially get three alerts.
There are also collection rules for the CSV performance:
Best Practices Analyzer monitor:
A new monitor was added to run the Best Practices Analyzer. You can read more about the BPA here:
This monitor is shipped DISABLED out of the box to reduce noise, however, you can enable it if you would like to create alerts when your Server 2008 R2 computers are not following best practice configurations.
We can open Health Explorer and get detailed information on what’s not up to snuff:
Alternatively – we can run this task on demand to ensure we have resolved the issues:
Changes to built in Monitors and Rules:
Many rules and monitors were changed from a default setting, to provide a better out of the box experience. You might want to look at any overrides you have against these and give them a fresh look:
- “Logical Disk Availability Monitor” renamed to “File System error or corruption”
- I wrote more about this monitor here: http://blogs.technet.com/b/kevinholman/archive/2010/07/29/logical-disk-availability-is-critical-what-does-this-mean.aspx
- “Avg Disk Seconds per Write/Read/Transfer” monitors changed from Average Threshold monitortype to Consecutive Samples Threshold monitortype.
- This is VERY good – this stops all the noise for the default enabled Sec/Transfer monitor, caused by momentary perf spikes.
- The default threshold is set to “0.04” which is 40ms latency. This is a good generic rule of thumb for the typical server.
- The default sample rate is once per minute, for 15 consecutive samples.
- Note – make sure you implement or at least evaluate hotfixes 2470949 or 2495300 for 2008R2 and 2008 Operating systems, which affect these disk counters.
- Make sure you look at any overrides you had previously set on these – as they likely should be reviewed to see if they are still needed.
- Disabled “Percentage Committed Memory in Use” monitor
- This monitor used to change state when more than 80% of memory was utilized. This created unnecessary noise due the fact that more and more server roles utilize all available memory (SQL, Exchange) and this monitor was not always actionable.
- Disabled “Total Percentage Interrupt Time” and “Total DPC Time Percentage”.
- These monitors would often generate alert and state noise in heavily virtualized environments, especially when the CPU’s are oversubscribed or heavily consumed temporarily. These were turned off by default, because there are better performance counters at the Hypervisor host level to track this condition than these OS level counters.
- Added “Free System Page Table Entries” and “Memory Pages per Second” monitors. These are both enabled out of the box to track excessive paging conditions. Also added MANY perf collection rules targeting memory counters, some disabled by default, some enabled.
- “Total CPU Utilization Percentage” monitor was increased from 3 to 5 samples. The timeout was shortened from 120 to 100 seconds (to be less than the interval of 120 seconds).
- Disabled the following perf counter collection rules by default:
- Avg Disk Sec/Write
- Avg Disk Sec/Read
- Disk Writes Per Second
- Disk Reads Per Second
- Disk Bytes Per Second
- Disk Read Bytes Per Second
- Disk Write Bytes Per Second
- Average Disk Read Queue Length
- Average Disk Write Queue Length
- Average Disk Queue length
- Logical Disk Split I/O per second
- Memory Commit Limit
- Memory Committed Bytes
- Memory % Committed Bytes in use
- Memory Page Reads per Second
- Memory Page writes per second
- Page File % use
- Pages Input per second
- Pages output per second
- System Cache Resident Bytes
- System Context Switches per second
- Enabled the following perf counter collection rules by default:
- Memory Pool Paged Bytes
- Memory Pool Non-Paged bytes The Windows Computer discovery added a “ProductType <> WinNT” to further filter out incorrect discoveries.
- The Windows Disk partition discovery changed a propertyname from “Bootable” to “BootPartition” to fix an old issue.
- Added a new Monitortype for NetworkAdapter.PercentBandwidthUsed
- This also added a new DataSource – which runs a script to collect %Utilization of a NIC – almost identical to what I wrote about previously here: http://blogs.technet.com/b/kevinholman/archive/2011/03/02/how-to-collect-performance-data-from-a-script-example-network-adapter-utilization.aspx
- There is a new Rule and Monitor which use this datasource (and they probably cook down correctly) to collect/inspect/monitor this every 5 minutes. This negates the need to create a custom one like my example above.
- “Available Megabytes of Memory” monitor script was updated. The default value for threshold was changed to “2.5” to “100”.
- Minor update to the Logical disk defrag monitor
- Modified the tolerances and ToleranceTypes of several optimized performance collection rules.
A full list of all disabled rules, monitors and discoveries is available in the guide in the Appendix section. The disabling of all these logical disk and memory perf collections is AWESOME. This MP really collected more perf data than most customers were ready to consume and report on. By including these collection rules, but disabling them, we are saving LOTS of space in the databases, valuable transactions per second in SQL, network bandwidth, etc… etc.. Good move. If a customer desires them – they are already built and a quick override to enable them is all that’s necessary. Great work here. I’d like to see us do more of this out of the box from a perf collection perspective.
Changes to MP views:
The old on the left – new on the right:
Top level logical disk and network adapter state views removed.
Added new views for Cluster Shared Volume Health, and Cluster Shared Volume Disk Capacity.
New Reports! Performance by system, and Performance by utilization:
There are two new reports deployed with this new set of MP’s (provided you import the new reports MP that ships with this download – only available from the MSI and not the catalog)
***Note: These two new reports are shipped in their own new MP: the Microsoft.Windows.Server.Reports.mp. These reports are supported only when your SQL servers supporting the OpsMgr backend are SQL 2008 or later. They will not deploy on SQL 2005.
To run the Performance by System report – open the report, select the time range you’d like to examine data for, and click ‘”Add Object”. This report has already been filtered only to return Windows Computer objects. search based on computer name, and add in the computer objects that you’d like to report on. On the right – you can pick and choose the performance objects you care about for these systems. We can even show you if the performance value is causing an unhealthy state – such as my Avg % memory used – which is yellow in the example:
Additionally – there is a report for showing you which computers are using the most, or the least resources in your environment. Open “Performance by Utilization”, select a time range, choose a group that contains Windows Computers, and choose “Most”. Run that, and you get a nice dashboard – with health indicators – of which computers are consuming the most resources, and potentially also impacted by this:
Using the report below – I can see I have some memory issues impacting my Exchange server, and my Domain Controller is experiencing disk latency issues.
By clicking the DC01 computer link in the above report – it takes me to the “Performance by System” report for that specific computer – very cool!
In summary – the Base OS MP is already a rock solid management pack. This made some key changes to make the MP even less noisy out of the box, and added critical support for discovering and monitoring Cluster Shared Volumes.
Known Issues in this MP:
1. A note on upgrading these MP’s – I do not recommend using the OpsMgr console to show “Updates available for Installed Management Packs”. The reason for this, is that the new MP’s shipping with this update (for CSV’s and BPA) are shipped as new, independent MP’s…. and will not show up as needing an update. If you use the console to install the updated MP’s – you will miss these new ones. This is why I NEVER recommend using the Console/Catalog to download or update MP’s…. it is a worst practice in my personal opinion. You should always download the MSI from the web catalog at http://systemcenter.pinpoint.microsoft.com and extract them – otherwise you will likely end up missing MP’s you need.
2. The “Available Megabytes of Memory” monitor script was updated in this version. Along with this update, the default threshold was changed from “2.5” to “100”. The current monitor – the “100” reflects “MBytes”. This value is a good indication of memory pressure, however, in your environment this might create a lot of alerts that might not be actionable depending on your environment. You should review any previous overrides you have set on this monitor, and adjust the default setting as necessary.
3. The “Logical Disk Free Space” monitors were completely re-written. The datasource and monitortype was changed from a script that runs once per hour and drives monitor state, to a new script that runs once every 15 minutes, and drives monitor state after 4 consecutive samples. That seems like a good design change to control any noise from fluctuating disks. However, running the script every 15 minutes might increase the performance impact with more scripts per hour executing on your agents. The script datasource no longer outputs the %Free and MBFree values in the propertybag, therefore – these had to be removed from the Alert Description and Health Explorer. The monitor still works as designed – it creates an alert whenever the threshold is breached. The only change exposed to the end user – is that these values for actual free space in MB and % are not going to be exposed to the alert notification recipient.
4. When you try and run the report “Performance By Utilization” you get an error:
An error has occurred during Report Processing.
Query execution failed for dataset ‘PerfDS’.
Procedure or function Microsoft_SystemCenter_Report_Performance_By_Utilization has too many arguments specified.
On a reporting server without remote errors enabled – you might only see the top two lines in the error above. I recommend enabling remote errors on you reporting server so the report output will show you the full details of the error: How to Enable Remote errors on SQL reporting server
If you are getting the “too many arguments specified” error, this is caused by the Windows 2003 MP. It also contains the stored procedure definition for Microsoft_SystemCenter_Report_Performace_By_Utilization, however the definition in the Windows 2003 MP is missing the “@DataAggregation INT,” variable. Depending on the MP import process, it is possible that the stored procedure from the Microsoft.Windows.Server.Reports.mp will not be deployed, which does contain this variable. In order to resolve this issue – we need to modify the existing stored procedure, and add the “@DataAggregation INT,” line just below the “Alter procedure” line. Ensure you back up your Data Warehouse database FIRST, and if you are not comfortable editing stored procedures, open a case with Microsoft on this issue. An alternative, is to use the SCOM Authoring console, open the Microsoft.Windows.Server.Reports.mp file, go to reporting, Data Warehouse Scripts, Microsoft.Windows.Server.Reports.PerformancebyUtilization.Script properties, Install tab, and copy the actual script. You can run this script in a SQL query window targeting your DW database, and it will create/modify your sproc.
The above instructions ONLY cover the SPECIFIC “Too many arguments” error. If you are getting ANY OTHER error, the above method will not resolve your issue and you should open a case for resolution.