How to multihome a large number of agents in SCOM


 

Quick download:  https://gallery.technet.microsoft.com/SCOM-MultiHome-management-557aba93

 

I have written solutions that include tasks to add and remove management group assignments to SCOM agents before:

https://blogs.technet.microsoft.com/kevinholman/2017/05/09/agent-management-pack-making-a-scom-admins-life-a-little-easier/

 

But, what if you are doing a side by side SCOM migration to a new management group, and you have thousands of agents to move?  There are a lot of challenges with that:

 

1.  Moving them manually with a task would be very time consuming.

2.  Agents that are down or in maintenance mode are not available to multi-home

3.  If you move all the agents at once, you will overwhelm the destination management group.

 

I have written a Management Pack called “SCOM.MultiHome” that will manage these issues more gracefully.

 

It contains one (disabled) rule, which will multihome your agents to your intended ManagementGroup and ManagementServer.  This is also override-able so you can specify different management servers initially if you wish:

 

image

 

This rule is special – in how it runs.  It is configured to check once per day (86400 seconds) to see if it needs to multi-home the agent.  If it is already multi-homed, it will do nothing.  If it is not multi-homed to the desired manaement group, it will add the new management group and management server. 

But what is most special, is the timing.  Once enabled, it has a special scheduler datasource parameter using SpreadInitializationOverInterval.  This is very powerful:

<DataSource ID="Scheduler" TypeID="System!System.Scheduler"> <Scheduler> <SimpleReccuringSchedule> <Interval Unit="Seconds">86400</Interval> <SpreadInitializationOverInterval Unit="Seconds">14400</SpreadInitializationOverInterval> </SimpleReccuringSchedule> <ExcludeDates /> </Scheduler> </DataSource>

 

What this will do, is run once per day, but the workflow will not initialize immediately.  It will initialize randomly within the time window provided.  In the example above – this is 14400 seconds, or 4 hours.  This means if I enable the above rule for all agents, they will not run it immediately, but randomly pick a time between NOW and 4 hours from now to run the multi-home script.  This keeps us from overwhelming the new environment with hundreds or thousands of agents all at once.  You can even make this window bigger or smaller if you desire by editing the XML here.

 

Next up – the Groups.  This MP contains 8 Groups.

 

image

Let’s say you have a management group with 4000 agents.  If you multi-homed all of these to a new management group at once, it would overwhelm the new management group and take a very long time to catch up.  You will see terrible SQL blocking on your OpsMgr database and 2115 events about binding on discovery data while this happens. 

The idea is to break up your agents into groups, then override the multi-home rule using these groups in a phased approach.  You can start with 500 agents over a 4 hour period, and see how that works and how long it takes to catch up.  Then add more and more groups until all agents are multi-homed.

These groups will self-populate, dividing up the number of agents you have per group.  They query the SCOM database and use an integer to do this.  By default each group contains 500 agents, but you will need to adjust this for your total agent count.


  <DataSource ID="DS" TypeID="SCOM.MultiHome.SQLBased.Group.Discovery.DataSource">
    <IntervalSeconds>86400</IntervalSeconds>
    <SyncTime>20:00</SyncTime>
    <GroupID>Group1</GroupID>
    <StartNumber>1</StartNumber>
    <EndNumber>500</EndNumber>
         
    <TimeoutSeconds>300</TimeoutSeconds>
  </DataSource>
</Discovery>

Also note there is a sync time set on each group, about 5 minutes apart.  This keeps all the groups from populating at once.  You will need to set this to your desired time, or wait until 10pm local time for them to start populating.

 

Wrap up:

Using this MP, we resolve the biggest issues with side by side migrations:

 

1.  No manual multi-homing is required.

2.  Agents that are down or in maintenance mode will multi-home when they come back up gracefully.

3.  Using the groups, you can control the load placed on the new management group and test the migration in phases.

4.  Using the groups, you can load balance the destination management group across different management servers easily.


Comments (8)

  1. M.Mathew says:

    Another great article!! .Thanks @Kevin!

  2. msviborg says:

    Hi Kevin
    Another great article making life as a SCOM admin a lot easier – THANK YOU!
    We have a regional domains and dedicated Management Servers per region. Would it be possible to control that via this script?
    I’m thinking like controlling which group the servers are added to, via a suffix or something, and then making sure the server in that group is added to a specific Management Server?

    Thanks in advance
    Michael

    1. Kevin Holman says:

      Yes, sure you could do this…. for the initial multi-home to a second management group.

      Why do you have regional management servers? Do you mean multiple management servers, in the same SCOM management group, but in different locations? If that’s the case, that’s a really bad design. Management servers should all be in the same physical location/network and so should the SCOM DB’s. Gateways can be location dependent if really required.

  3. stemo76 says:

    Thanks again for the great contribution. I ran this over the weekend with pretty good success. We recently installed a 1801 SAC instance and have started the migration.

    I did the groups a bit differently. Created a new property using HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PBCACPT\Parent Health Services\0\AuthenticationName

    Now I can build groups based on the current primary management server and build a mapping to the new management servers in the new instance.

    For some reason I did get a lot of these errors but when running the commands locally on the server it worked, and the applet was updated. I haven’t found the cause.

    Event Description: SCOM.MultiHome.AddMG.Rule.WA.ps1 :
    FATAL ERROR: No management groups were found on this agent, which means a scripting error.
    Terminating script.

    1. stemo76 says:

      I see the reason I am getting those error but I don’t know why the count value isn’t being populated. Checked another server with the same PS version and it works fine.

      Copyright (C) 2009 Microsoft Corporation. All rights reserved.

      PS C:\Windows\system32> # Load Agent Scripting Module
      PS C:\Windows\system32> $AgentCfg = New-Object -ComObject “AgentConfigManager.MgmtSvcCfg”
      PS C:\Windows\system32> $MGs = $AgentCfg.GetManagementGroups()
      PS C:\Windows\system32> $MGsCount = $MGs.Count
      PS C:\Windows\system32> $mgscount
      PS C:\Windows\system32> $mgs.count
      PS C:\Windows\system32> $mgs

      managementGroupName : InstanceName
      ManagementServer : Server.corp.Company.org
      managementServerPort : 5723
      IsManagementGroupFromActiveDirectory : False
      ActionAccount : Local System

      1. Kevin Holman says:

        I have seen this before – it has to do with the object type for “$MGs = $AgentCfg.GetManagementGroups()”

        What I did was loop through each one and build a strong type defined [array]$whatever to count on. I just tested this on one of my WS2012R2 agents and it has the same issue.

        FOREACH ($MG in $MGs)
        {
        $MGCount++
        }

        That will give you a reliable count I think…..

        1. stemo76 says:

          I made that change in the script and will see if that stops those alerts and updates the last few servers that have to migrate. I tested on one of the boxes that would consistently fail and the FOR loop worked.

          Also for some reason that .count function would often cause the PowerShell window to crash

        2. stemo76 says:

          Made the change yesterday and it appears to be working good. A couple servers got the config change and no new errors.

Skip to main content