Understanding SCOM Resource Pools


 

image

 

 

Resource pools are nothing new – they were introduced in SCOM 2012 RTM, for two reasons:

1.  To remove the single-point-of-failure that was the RMS role in SCOM 2007.

2.  To provide a mechanism for high availability of agentless/remote workflows, such as Unix/Linux, Network, and URL monitoring, among others.

 

That said – they are often not fully understood.

 

Lets talk about the primary components of a Resource Pool.  I am going to “dumb this down” a lot…. because it is actually quite complex behind the scenes.  So I will break this down more into “roles” with regard to Resource Pools.  The primary “role” components we will discuss are:

1.  Members

2.  Observers

3.  Default Observer

 

Members of a pool are either a Management Server or a Gateway Server. 

Observers are “observer-only” roles.  These will be a Management Server or a Gateway server, that do NOT participate in loading workflows for the pool, however they participate in quorum decisions.  This is actually pretty rare to do anything with Healthservice based observer-only roles…. but you would use these if you wanted high availability for your pool, but only a limited number of Healthservices actually running pool workflows.  This is rarely used under normal circumstances.

Default Observer is the SCOM Operations Database.  This is set to “Enabled” or “Disabled” for every pool.  This is set to enabled by default for all pools created in the UI.  It is set to disabled by default, for all pools created via PowerShell, using the New-ResourcePool command.  The “reason” this exists is for the following:

To allow for a pool to have high availability when you have two management servers in a pool

 

Let’s talk about that.

A pool requires ONE or more members.

A pool requires THREE (quorum voting) members to establish high availability.

High availability is the ability to have a member be unavailable, with no loss of monitoring.

 

The reason we need THREE (quorum voting) members (not two) for high availability is because of the quorum algorithm.  We require that MORE than 50% of the quorum voting members in a pool be available.  If you have only two members of a pool, and one is down, you have lost quorum, because of the “greater than 50%” rule.

Therefore – the “Default Observer” was dreamed up, so customers would not HAVE to deploy a minimum of THREE management servers just to get high availability for their Resource Pools.  It is a special quorum voting “observer” role, to allow for high availability of pools when you have two management servers deployed.  This reduced cost and complexity for a basic SCOM deployment.

 

Lets break this into “scenarios”

 

Single Management server in pool

The default observer is enabled by default.

There is no high availability, because the management server is a single point of failure.

The default observer provides no benefit (nor harm) in this case.

 

Two management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are three voting members (2 MS + Default Observer)

If you disable the default observer, you will lose high availability for the pool.

 

Three management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are four voting members (3 MS + Default Observer)

By default – you can only have ONE management server down, to maintain the pool. (greater than 50% rule) because if two MS are down, this is 50% of voting members, so pool suicides.

The default observer in this case provides NO value.  It does not increase the number of management servers that can be down, therefore it does not increase pool stability.

You can consider removing the DO (Default Observer) in this scenario.

 

Four management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are five voting members (4 MS + Default Observer)

By default – you can only have TWO management server down, to maintain the pool. (greater than 50% rule) because if three MS are down, this is greater than 50% of voting members, so pool suicides.

The default observer in this case provides significant value, because it increases the number of management servers that can be down.  Without the DO in this case, you’d only have 4 quorum members, which only allows for ONE to be unavailable.

 

Five or more management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are 6 voting members (5 MS + Default Observer)

By default – you can only have TWO management server down, to maintain the pool. (greater than 50% rule) because if three MS are down, this is exactly 50% of voting members, so pool suicides.

The default observer in this case provides NO value.  It does not increase the number of management servers that can be down, therefore it does not increase pool stability.

You can consider removing the DO (Default Observer) in this scenario.

 

One could argue – that once you have 3 or more management servers in a pool, any “odd” number of management servers would be a good consideration to remove the DO from the pool.  I’d also argue that once you hit 5 management servers, you are probably big enough that the database is under significant load (you wouldn’t typically have 5 management servers in a small environment).  When the database is under heavy load, the default observer might not perform well, and might experience latency in resource pool calculations/voting.

The way the default observer plays a role – is that each MANAGEMENT SERVER in the pool, queries its own local SDK service – which allows it to get data from the database.  There is a table in the SCOM Operations database for the default observer.  So if the SDK service is under load, or the database, we could experience latency that otherwise would not exist.

 

Gateways as resource pool members

 

Next – we should discuss the Gateway role as it pertains to Resource Pools.  Microsoft support resource pool membership for Management Servers, AND for Gateway servers. 

For instance, a customer might monitor Unix/Linux servers in a firewalled off DMZ, or across a small WAN circuit where you want the agentless communication localized.  In this scenario, a customer might create dedicated resource pools for Gateways in those locations, to perform monitoring.

 

Single Gateway server in pool

The default observer is enabled by default.

There is no high availability, because the Gateway server is a single point of failure.

The default observer should NOT be used here, because Gateways do not have a local SDK service, therefore they cannot query the database.

 

Two Gateway servers in pool

The default observer is enabled by default.

One would THINK there is high availability for the pool, because there are two GW’s in the pool, right?  HOWEVER – that is NOT the case.  As we discussed above – we need three voting members to establish high availability for a pool.  Since the Default Observer is NEVER valid for a pool consisting of Gateways, there are only TWO members of this pool.  The pool will run, and will load balance workflows, but if either pool member goes down, the pool suicides.  In this case – you actually have WORSE availability than if you placed a single member in the pool!

In order to maintain high availability for a pool made of Gateways, you need to have THREE GW’s in the pool.

The default observer should NOT be used here, because Gateways do not have a local SDK service, therefore they cannot query the database.

 

Three Gateway servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are three voting members (3 GW)

By default – you can only have ONE Gateway server down, to maintain the pool. (greater than 50% rule) because if two GW are down, this is >50% of voting members, so pool suicides.

The default observer should NOT be used here, because Gateways do not have a local SDK service, therefore they cannot query the database.

 

 

Let’s take a minute and process this.

 

What we have learned, is that you should remove the DO from any pool comprised of Gateways.

You should consider removing the DO from pools when 5 or more Management Servers are present.

If your pools are stable….. and you aren’t having any problems with high availability….. then this really doesn’t make much difference….. which is why the defaults are set like they are.

 

So we have talked about pool members, and the default observer…… but what about the “observer” role?

This role is really unique, and will not be used very often.  I cannot think of a single enterprise deployment where I have seen it used.  Generally speaking – if we are adding a dedicated observer for a pool (which is a management server or a GW server) then why not just make that server a full blown pool member?

There is only one scenario where I can think of where this might be useful.  Such as a company with a datacenter with SCOM deployed.  In the SAME DATACENTER, they have a DMZ with two gateways deployed because of firewall rules.  In this case, you could potentially make their parent management server a dedicated observer only, and this would work because tcp_5723 is open already for Healthservice communication.  This is incredibly rare, and the best practice would be to just go ahead and plan for three Gateways servers in the DMZ.

 

Remember – for resource pool members – Microsoft supports Management Servers and Gateways.

For resource pool observers – the same, Management Servers and Gateways.

 

That said – I have done some testing making an *agent* a dedicated observer, such as the DMZ scenario above, and it does work.  The agent becomes a voting member for quorum, and high availability is created by this.  Microsoft didn’t plan or test this scenario – so it is technically unsupported.

Which got me to thinking – “what it I create a resource pool, and make its membership strictly agents”???

Well, that works too.  You cannot do this using the UI, but you can in PowerShell.  I create a resource pool of only agents, then set up URL monitoring to that pool, and high availability and load distribution worked great.  Again, not technically supported by Microsoft, but a unique capability nonetheless.

 

Lastly – I will demonstrate some PowerShell commands to work with this stuff.

 

To disable the default observer for a pool:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $pool.UseDefaultObserver = $false $pool.ApplyChanges()

 

To add or remove Management Servers or Gateways from a manual pool:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $MS = Get-SCOMManagementServer -Name "YourMSorGW.domain.com" $pool | Set-SCOMResourcePool -Member $MS -Action "Add" $pool | Set-SCOMResourcePool -Member $MS -Action "Remove"

 

To add or remove Management Servers or Gateways as Observers only to a pool:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $Observer = Get-SCOMManagementServer -Name "YourMSorGW.domain.com" $pool | Set-SCOMResourcePool -Observer $Observer -Action "Add" $pool | Set-SCOMResourcePool -Observer $Observer -Action "Remove"

 

If you want to play with adding AGENTS as a resource pool member or observer (not supported) then simply change “Get-SCOMManagementServer” above – to “Get-SCOMAgent”

 

 

Credits:

A debt of gratitude to Mihai Sarbulescu at Microsoft for his guidance on this topic – he has forgotten more about Resource Pools than most people at Microsoft ever knew.  Smile


Comments (19)

  1. Tommy says:

    Thank you for the info 🙂 Very useful

  2. M.Mathew says:

    Gr8 Article.!!Thx for the post!

  3. Hi,
    I used the following command to create resource pool in a new SCOM 2016 installation:
    New-SCOMResourcePool -DisplayName “Displayname of the pool” -Member (Get.SCOMManagementServer | ? {expression}) -Description “Description of the pool”
    I checked both of them and the $_.UseDefaultObserver value is “False” by default. I did not change it. Maybe this is true for SCOM 2016 only?
    BTW, this is a good article as we got used to it from Kevin. Thank you for it again.
    Sandor

    1. Kevin Holman says:

      Thanks for the catch. Pools created in powershell are apparently different than pools created in the UI. I will update this.

  4. Ravi says:

    Hi Kevin,

    When you say “Pool Suicides” (when less then 50% members are available) do you mean that, all agents will loss communication to the resource pool and turns as greyedout agents?

    Ravi

    1. Kevin Holman says:

      No – i don’t mean that at all. Agent communication has NOTHING to do with resource pools.

      NOTHING. Resource pools are for workflows. Agents communicate directly to management servers, and have their own mechanism for failover, which has not changed from the SCOM 2007 design. Customers often get confused and think that resource pools handle agent failover. They do not, and there is no relation.

      When a pool suicides, this means the pool unloads itself from all members, and all workflows that were hosted by the pool are not initialized, and therefore do not run.

      1. a.elfimov says:

        Hi Kevin,
        Do I understand correctly there are no any relationships between “Pool Suicides” and the SCOM alert “The resource pool failed to heartbeat” and it’s two different problems with two different causes?

        1. Kevin Holman says:

          Those are related. A resource pool failing to heartbeat means the pool isnt healthy and stable. This could be due to pool suicides, database connectivity, database blocking, load, bad workflows, all kinds of reasons. If this is common, you start looking at what you have placed on the pool, and what other events are being logged on the management server OpsMgr event leg.

  5. Asger Nissen says:

    Awesome post. Just a quick note on a scenario where we use the Observer role.
    We monitor different SNMP enabled devices (getting traps) as Network Devices in SCOM via a resource pool that consists of two gateway servers. Some of these network devices only allows for two trap destinations. As we want the redundancy, but cannot use more than two servers in the pool (for the reason explained above) we use a SCOM agent as observer for the pool.

    1. Kevin Holman says:

      Asger – THANKS! That is a perfect reason for observers!

      SNMP traps can only be processed for a device, when the pool members hosts that specific device. Therefore, when sending SNMP traps to a device hosted by a pool, as you have figured out – you must send the traps to ALL members of the pool in order to ensure the trap will be processed.

      So by only allowing two hosting members of a pool, but adding an observer, you get the high availability without impacting trap reception.

      Excellent feedback!

  6. Kevin, hi.
    As always a great and very helpful post. Thanks to you and Mihai.

    I created two ps1 that might help to show the config and to set the observers accordingly:
    https://gallery.technet.microsoft.com/PoSh-Show-Resource-Pool-40d9b18f
    https://gallery.technet.microsoft.com/PoSh-Set-Resource-Pool-aea4e7be

    Best regards,
    Patrick

  7. JVD says:

    Kevin, I am having an issue with a customer, which uses a 2 MS + 1 OBS (Failover SQL Cluster DB). Fairly frequently, we are experiencing issues with the resource pools(All management servers resource pool unavailable) , which almost always occurs at night.
    As read in your post, you would advise to use an uneven amount of management servers. However this customer has 2 datacenters, ideally I would have an even amount of management servers on both sides to cover the load. Would it be advisable to move the DO to another server in this case?

    1. JVD says:

      Forgot to mention, the DB is under a significant load at night due to backups and maintenance.

      1. Kevin Holman says:

        Are the management servers split across multiple datacenters? In general, we dont recommend or support that configuration, and this is a very common misunderstanding with customers.

        Management servers require to be less than 5ms from each other AND the databases. In most cases where a customer has multiple datacenters, the network connection between DC’s is more than 5ms at all times, or they cannot guarantee to remain less than 5ms 24×7, such as times of high network saturation during backups, etc.

        This will cause resource pool failures.

        If that is your case, you have to consider some design changes, or you have to edit the registry to change the resource pool timeout and failure settings, from my blog article on tweaking management servers for large environments.

        1. JVD says:

          Hey Kevin,

          The management servers reside in the same datacenter, but they do reside in two different physical rooms. So latency is not an issue.
          The problem I seem to be having is that the pool seems to be under heavy load at night due to backups and SQL maintenance, which I assume causes pool instability.
          Would it be better to use an agent as an observer in this case, and remove the DO from the SQL server?

          1. Kevin Holman says:

            If this is happening at night – and you think it is backup related – then it is much more likely that the pool failures are caused by DB connectivity issues, not the DO.

            If the DO fails – this won’t cause pool instability, because the two MS in the pool will work just fine. I’d look at your disk I/O on the SQL server, and for other events in the MS event logs around this time for clues. So to answer you question, no, I would not recommend moving the DO to an agent, and I would recommend leaving the DO in the pool as the database with a size like that. I’d focus on the root cause of pool failure, which is likely SQL connectivity.

          2. JVD says:

            Hey Kevin,

            The system backup of the SQL server seems to be correlating with the pool issues. Have disabled the system backup for now. Thanks for the advice.

  8. Birdal says:

    Hi Kevin,
    I know that my question is not directly related to your article. But it is a design question. I a not so familiar with SCOM. We plan completely a new monitoring system based on SCOM 2016. Our environment is:

    – 2 Active Directory domains (different forests). There is no trusts between ADs.
    – Objekte: Windows Servers (800), Linux Servers (200), network components (200), some specific applications / services on both Active Directory.
    – We have 6 virtual servers for SCOM environment based on VMware.

    I prefer to locate all SCOM servers (Management Servers, console, SQL Servers, Gateway Servers, etc.) in the 1.AD, and open the necessary ports between Gateway Servers and the 2.AD.

    What is the best SCOM servers locations & design for this Environment?
    Which ports should be opened between SCOM Gateway Servers and the Domain Controllers in 2.AD?

    Thanks in advance.
    Birdal

Skip to main content