Exchange 2010 / 2013 – PAM and the Cluster Core Resources


At any given time, in every database availability group (DAG), there is one member that is responsible for the coordination of database actions across the DAG. This member is known as the Primary Active Manager (PAM). The current PAM can be determined by using Get-DatabaseAvailabilityGroup –Status, as shown below.

 

[PS] C:\>Get-DatabaseAvailabilityGroup -Status -Identity DAG | fl name,primaryActiveManager

Name                 : DAG
PrimaryActiveManager : MBX-1

 

The mailbox server that is the PAM is always the current owner of the Cluster Core Resource group.

 

[PS] C:\>Get-ClusterGroup -Cluster MBX-1

Name                                    OwnerNode                               State
—-                                    ———                               —–
Available Storage                       MBX-3                                   Offline
Cluster Group                           MBX-1                                   Online

 

The cluster group may contain several cluster resources. The PAM does not depend on the state of any of the resources in this group, and the PAM role will always be assigned to the node that owns the Cluster Group.

 

The Cluster Group can be moved between members using the cluster management tools. 

 

Windows 2008 R2 / Windows 2012 / Windows 2012 R2:

Command Line:

Cluster.exe cluster <DAGNAME> group “Cluster Group” /moveto:<NODE>

 

Powershell:

Move-ClusterGroup –cluster <DAGNAME> –name “Cluster Group” –node:<NODE>

*Note:  In Windows Server 2012 and Windows Server 2012 R2, Powershell is the preferred way to manage clusters.

 

Each DAG member that does not own the cluster group is a Standby Active Manager (SAM).  When the cluster group is moved between nodes, a notification process detects that the Cluster Group owner has changed. This triggers detection logic to determine the new PAM.  In this example, the cluster group is moved from MBX-1 to MBX-2.

 

PS C:\> Move-ClusterGroup -Name "Cluster Group" -Node MBX-2

Name                                    OwnerNode                               State
—-                                    ———                               —–
Cluster Group                           MBX-2                                   Online

 

A review of Microsoft-Exchange-HighAvailability/Operational crimson channel shows the promotion of MBX-2 from SAM to PAM:

 

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          8/3/2014 8:37:47 AM
Event ID:      227
Task Category: Role Monitoring
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-2.domain.com
Description:
Active manager configuration change detected. (PreviousRole='SAM', CurrentRole='PAM', ChangeFlags='Role, CurrentPAM', LastError='<none>')

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          8/3/2014 8:37:48 AM
Event ID:      111
Task Category: Role Monitoring
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-2.domain.com
Description:
Active manager role changed from 'SAM' to 'PAM'

 

A review of the Microsoft-Exchange-HighAvailability/Operational crimson channel shows the demotion of MBX-1 from PAM to SAM:

 

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          8/3/2014 8:37:47 AM
Event ID:      227
Task Category: Role Monitoring
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-1.domain.com
Description:
Active manager configuration change detected. (PreviousRole='PAM', CurrentRole='SAM', ChangeFlags='Role, CurrentPAM', LastError='<none>')

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          8/3/2014 8:37:47 AM
Event ID:      111
Task Category: Role Monitoring
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-1.domain.com
Description:
Active manager role changed from 'PAM' to 'SAM'

 

Other servers within the DAG also acknowledge the move but continue to maintain their existing roles:

 

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          8/3/2014 8:37:47 AM
Event ID:      227
Task Category: Role Monitoring
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-3.domain.com
Description:
Active manager configuration change detected. (PreviousRole='SAM', CurrentRole='SAM', ChangeFlags='CurrentPAM', LastError='<none>')

 

The Cluster service is also responsible for automatic arbitration of this group.  Automatic arbitration may occur for a number of reasons including: 

 

  • The failure of a member
  • The failure of a resource contained within the Cluster group

 

In most cases, Exchange administrators should not be concerned with the owner of the cluster group or the node designated as the PAM.  This is true even for DAGs that span multiple sites where the PAM may be a node in a distant datacenter.

 

In recent weeks I have fielded several questions from administrators that are concerned with which member owns the PAM. Questions like:

 

  • Should I set a preferred owner on the cluster group so that nodes in my primary datacenter are preferred over my disaster recovery datacenter?
  • How do I prevent a server in the disaster recovery datacenter from becoming the PAM?
  • Can I pause my nodes in cluster to prevent them from becoming the PAM?
  • Should I remove possible owner on the cluster name to prevent it from coming online on a server in my disaster recovery datacenter?

 

All of these questions require modifying properties of the cluster core resource group.  By default Exchange establishes the desired settings on the cluster core resource group.  Modifying these settings is typically not necessary and can sometimes cause undesired results.  For example, recently a customer paused all of the members in a disaster recovery datacenter to prevent those servers from becoming the PAM.  This worked very well in preventing arbitration of the cluster core resource group to these nodes, until instability of the Cluster service in the primary datacenter resulted in no members being able to take ownership the cluster core resources.  In this instance the PAM was lost, and coordination of database activities across the DAG failed. 

 

The Cluster service is designed to allow the cluster group to arbitrate freely across nodes.  Attempts to modify the failover behavior, or prevent failure between nodes, can yield undesired results and potential instability.  As a result, we recommend that you leave the default out-of-box settings intact, unless you are directed by Microsoft Customer Services and Support to change them.


Comments (15)

  1. TIMMCMIC says:

    @Sajid:

    I’m referring to pausing a clustered node within cluster.

    cluster.exe node /pause

    TIMMCMIC

  2. TIMMCMIC says:

    @ME

    NP

    TIMMCMIC

  3. Anonymous says:

    @Kannan…

    I’m glad you found it helpful.

    TIMMCMIC

  4. sachinT says:

    Excellent Article

  5. sajid says:

    " For example, recently a customer paused all of the members in a disaster recovery datacenter to prevent those servers from becoming the PAM."
    Do you mean they pause the server not mount the database when database fail over occur by suspend or set-mailboxserver command or what ? because we want that active database don’t fail over to DR site in normal days so we run this command..

    Suspend-MailboxDatabaseCopy –identity DB1EXCH3 –ActivationOnly
    or
    Set-MailboxServer –identity EXCH1 –DatabaseCopyAutoActivationPolicy Blocked
    kindly clarify.
    Regards

  6. Kannan says:

    Wonderful article. Helped me from modifying the PAM configure.
    Thanks

  7. Steven says:

    Thanks so much for the article. I’ve actually had this question for a long time. My Exchange 2010 environment is configured with static DNS records. I realize that’s not ideal, but it was a byproduct of some DNS record issues in Windows 2008 that have
    since been fixed. At any rate, it seems a variety incidents may cause quorum ownership to flip to the DR site, this causes the cluster virtual name to be unreachable by DNS and things like backups end up failing. In we believe this is problematic during certain
    WAN conditions where the link may flap rapidly. We have precious little evidence to back this up but we believe the DR side end up locking the witness, which when the link goes down again makes the local nodes in the production site lose quorum and of course
    production then goes down. Obviously I’m simplifying the scenario for brevity. Questions from all this. First, I don’t understand why ownership seems to invariably shift to the remote site rather than the second local node? Second, and I realize you advised
    not to mess around, but given the particulars is there any merit to modifying the possible owners list in the FSW’s properties? Third, if the theory of the outage holds water, which I leave to your expert judgement, is it a better choice to just add a third
    node in the production site and just forget about the FSW altogether? Is that considered a more robust solution in terms of resilience?

    I know that’s a lot, but please let me know what you think, as this issue has been dogging us for a while. Thanks.

    Steven

  8. TIMMCMIC says:

    @Steven…let’s see if we can break this down.

    First – it should be considered a REQUIREMENT to have dynamic DNS whenever a mutli-subnet cluster solution is deployed. If this is not available – then it is your responsibility to determine or craft a way to determine the cluster group owner and update the
    subsequent DNS records. (Honestly – i’d be putting my effort into get out of static records and into dynamic…it’s really time :-))

    Second – you are correct. When DNS records are broken multiple issues can arise – like backups failing and cluster tools being upset that they cannot contact the name.

    Third – Please do not try to be crafty and change anything. Backups failing and the like are the least of your issues when you mess with these resources and break something bigger (like Exchange!).

    Forth – Your theory could hold water – but there would have to be some other factors. Essentially when the nodes believe that communications are lost such that the file share witness is necessary to maintain quorum each node begins an arbitration process for
    the witness. The node that owns the cluster core resources gets the first shot – all other nodes sleep 6 seconds. So in theory if the group was owned on a node in the primary site, and the primary site had access to the witness, then this is where the resources
    should have stayed and quorum maintained. If the group moved during this network outage – then this would imply that the primary nodes could not access the witness, a node in DR could, and therefore the group was arbitrated to a surviving node. Really those
    this conversation is academic in that this is not your issue – your issue is why are you having the networking problems you are having and what are you doing to fix them.

    Lastly – the arbitration of the cluster group in absence of no network issues etc is random node selection. Essentially something happens, maybe the witness resource fails or the name resource has an issue and the cluster just pushes the group to another node.
    The next version of windows is site aware – and will start arbitrating locally before remote…but that does not help you much right now.

    TIMMCMIC

  9. Steven says:

    All points taken.

    Thanks for responding so quickly. I’ve never really wanted to mess with the underlying cluster, but this is the first article I turned up to support that reluctance. I’ll continue to resist that, but at least I have a good counter argument when management asks
    about possible owners, thanks to you again.

    Static DNS stuck due to priorities. We make it a part of our procedures to monitor and switch back, not a big issue. We’ll be dynamic for 2013/16, Server 2012 as you mentioned may also help.

    I’m more concerned with the resulting outage and why the 2 production nodes with LAN connectivity to each other and the FSW would lose quorum. The ownership issue / theory only relates to what happens as the cluster tries to recover. MS support has laid out
    some possibilities, but we’re not entirely clear on them and lack the logging necessary to prove anything out. I have some reading to get through on the matter.

    With respect to the WAN issue itself we found the problem with our redundancy, so hopefully the catalyst will go away.

  10. TIMMCMIC says:

    @Steven:

    Some customers will actually schedule a script to run to move the group back. You might find something like this helpful in your interim while you’re working on dynamic DNS.

    Without actually have looked at the logs myself – i’m not sure I could elaborate on potential causes.

    TIMMCMIC

  11. MR. Singh says:

    Thank you for posting this article. really it is very helpful for me. thanx a lot keep it up.

  12. TIMMCMIC says:

    @MR. Singh…

    Thanks for the comment.

    TIMMCMIC

  13. Cloud-Ras says:

    Outstanding blog article 🙂

  14. TIMMCMIC says:

    @Clous-Ras…

    I’m glad you enjoyed it.

    TIMMCMIC