QuickTip: Disabling workflows to optimize for large environments


 

image

 

One of the coolest things about SCOM is how much monitoring you get out of the box.

That said, one of the biggest performance impacts to SCOM is all the monitoring out of the box, plus all the Management Packs you import.  This has a cumulative effect, and over time, can impact the speed of the console, because of all the activity happening.

I have long stated, the biggest performance relief you can give to SCOM, is to reduce the number of workflows, reduce the classes and relationships, and keep things simple.

SCOM 2007 shipped back in March 2007.  In 10 years, We have continuously added management packs to a default installation of SCOM, and continuously added workflows to the existing MP’s.

For the most part – this is good.  These packs add more and more monitoring and capabilities “out of the box”.  However, in many cases, they can also add load to the environment.  They discover class instances, relationships, add state calculation, etc.  In small SCOM environments (under 1000 agents) this will have very little impact.  But at large enterprise scale, every little thing counts.

 

I have already written about some of the optional things you can consider (IF you don’t use the features), such as removing the APM MP’s, and removing the Advisor MP’s.

 

Here is one I came across today with a customer:

 

I noticed on the server that hosts the “All Management Servers Resource Pool” we have some out of the box PowerShell script based rules that were timing out after 300 seconds, and running every 15 minutes:

Collect Agent Health States (ManagementGroupCollectionAgentHealthStatesRule)

Collect Management Group Active Alerts Count (ManagementGroupCollectionAlertsCountRule)


image

 

These scripts do things like “Get-SCOMAgent” and “Get-SCOMAlert”.  They were timing out, running constantly for 5 minutes, then getting killed by the timeout limit, then starting over again.  This kind of thing will have significant impact on SQL blocking, SDK utilization, and overall performance.

 

Now, in small environments, this isn’t a big deal, and these will return results quickly with little impact.  However, in a VERY large environment, Get-SCOMAgent can take 10 minutes or more just to return the data!!!!  If you have hundreds of thousands of open alerts, it can take just as long to run the Alert SDK queries as well.

The only thing these two rules are used for is to populate a SCOM Health dashboard – and these are of little value:

 

image

 

I recommend that larger environments disable these two rules….. as they will be very resource intensive for very minimal value.  If you feel like you like to keep them, then override them to 86400 seconds, and set a sync time to run each at slightly different times, off peak, like 23:00 (11pm), and set the timeout to 600 seconds.  If it cannot complete in 10 minutes, then disable them…..  also – stagger the sync time for the other rule to begin at 23:20 (11:20pm) so they aren't both running at the time time.

 

image

 

Additionally, in this same MP (Microsoft.SystemCenter.OperationsManager.SummaryDashboard) there are two discoveries.

Collect Agent Versions (ManagementGroupDiscoveryAgentVersions)

Collect agent configurations (ManagementGroupDiscoveryAgentConfiguration)

These discoveries run once per hour, and also run things like Get-SCOMAgent – which is bad for large environments, especially with that frequency.

The only thing they do is populate this dashboard:

 

image

 

I rarely ever see this being used and recommend large environments disable these as well. 

image

 

Speed up that SCOM deployment!

image


Comments (11)

  1. M.Mathew says:

    Another great Post!! Thx for sharing.

  2. Cdufour says:

    Thx for tips!

  3. Bill Carlson says:

    Thanks as always for your tips Kevin. I was wondering why my ‘Active Alerts’ count was showing so low (vastly incorrect). The Rule indeed uses Get-SCOMAlert, but specifies -resolutionstate 0. For those of us with Connectors an alert doesn’t stay New longer than a minute. To be more accurate, as you said if someone really wanted to keep using it, disable the original and create a duplicate changing Get-SCOMAlert -ResolutionState 0 to Get-SCOMAlert -Criteria “ResolutionState 255”

    1. Bill Carlson says:

      Correction: It should be Get-SCOMAlert -Criteria “ResolutionState 255”

    2. Kevin Holman says:

      Hi Bill!

      Yep. This whole MP was just an added on dashboard to show “cool” stuff…. but is inefficient in how it gets the data, and not applicable to all customers because of resolution states just as you pointed out. It would be better to consider “active alerts” as “not equal 255”.

  4. Tonka Pushchaira says:

    Kevin, yourself and others at Microsoft utilize environment size definitions of small, large, very large, etc..
    What is the ballpark of what you (as an individual or MS as a group) consider each of these definitions?

    1. Kevin Holman says:

      Ugh. I don’t think there really is a “standard” and I am not sure everyone would agree with what I think.

      A SCOM Management group can range in Agent count from 0 to 15,000 agents. We know that anything over 6000 agents we reduce the number of allowed console sessions to 25, just because of the performance impact there.

      In my personal experience, it is usually around the 1000 agent mark that I start to see where performance becomes a huge factor.
      Consoles running slow, SDK responses slow, disk latency and CPU become super important on SQL and Management Servers, etc. The number and complexity of management packs imported, start to take a much bigger toll once the agent count grows.

      By comparison – I have seen 5,000 agent management groups SCREAM…. because the customer was only monitoring about 30 custom critical rules and monitors, and they did not import all the typical enterprise MP’s you normally see. That environment was super fast and responsive in the consoles.

      So personally, I consider anything less than 500 agents to be “small”, 500 to 2000 agents “large”, and anything over 2000 agents to be “very large”.

      What is interesting, is that deployments with 2000 agents or 10,000 agents often have very similar characteristics, responsiveness, and similar problems and challenges.

      The largest single management group I have ever seen was a little over 17,000 agents. That customer ran on all VM’s, for SQL, and Management servers, and while the console responsiveness was far less than ideal, the monitoring environment was stable, and very maintainable. I would still advocate for any environment over 2000 agents to use dedicated physical hardware for the SQL database servers, but we see more and more customers adopt VM’s for everything these days.

      1. Tonka Pushchaira says:

        Thanks Kevin, I appreciate the great response. Our environment is 4000+ and I am currently working on our in place upgrade to 2016, trimming a lot of the fat that has slowed down our DB and DW (which are VMs). These articles come with great value and allow me to focus more on the migrating than the hunting down of troublesome rules and discoveries.

        I apologize for asking an out of topic question, but I’ve tried reaching out to Tao Yang to see if a self-maintenance MP for 2016 was in the pipeline. I’ve not received any response so have begun shoehorning the 2012 version. Do you have any info on this from your side of the table?

      2. rob1974 says:

        As for the physical SQL server. I gave up on requesting those. I tell my customers (Always over 500 servers) what the performance should be and let them decide.
        e.g. i ask for max. 10 ms disk latency on SQL (and max. 20ms disk latency on management servers) measured from the windows OS. The ones who choose VM’s usually suffer from performance issues, but a well thoughtout virtualization strategy can provide the required specs.

        1. Kevin Holman says:

          Absolutely agreed Rob,

          Nothing at all wrong with using a VM… I explain that as long as we get dedicated CPU’s, and memory, and we aren’t stacked on top of a host that is saturated, and they can provide robust storage with latency and IOPS guarantees, we won’t have an issue. I explain that sometimes that might mean having a host dedicated to a single VM, or only a couple VM’s in order to make that a reality. Customers usually scoff at the idea and want every VM to be a standard commodity, which is often unrealistic.

  5. Thanks for sharing this!

Skip to main content