RELEASED: Exchange 2010 Database Redundancy Check Script


Ensuring that your servers are operating reliably and that your mailbox database copies are healthy are primary objectives of daily Exchange 2010 messaging operations. Of course you must actively monitor the hardware, the Windows operating system, and the Exchange 2010 services. But when running in an Exchange 2010 mailbox resiliency environment, it is important that you monitor the health and status of the database availability group (DAG) and your mailbox database copies. It is especially vital to perform data redundancy risk management and monitor for periods in which a replicated database is down to just a single copy. This is particularly critical in environments that do not use RAID and instead deploy Just a Bunch Of Disks (JBOD). In a RAID environment, a single disk failure does not affect an active mailbox database copy. However, in a JBOD environment, a single disk failure will trigger a database failover. It is therefore a top priority for administrators to know when they are down to a single healthy copy of a database.

Note It’s important to understand how we count copies. When you create a new database, but before you run Add-MailboxDatabaseCopy, you have one copy of the database. When you run Add-MailboxDatabaseCopy for the first time, you are creating your second database copy.

Exchange 2010 includes several built-in tools and features that should be used as part of regular proactive monitoring of a highly available Exchange environment, such as the Get-MailboxDatabaseCopyStatus and Test-ReplicationHealth cmdlets, and the CollectOverMetrics.ps1 and CollectReplicationMetrics.ps1 scripts.

Today, we are releasing an additional PowerShell script called CheckDatabaseRedundancy.ps1. As its name implies, the purpose of the script is to monitor the redundancy of replicated mailbox databases by validating that there is at least two configured and healthy and current copies, and to alert you when only a single healthy copy of a replicated database exists. In this case, both active and passive copies are counted when determining redundancy.

When executing the script, you must specify either a database name or a DAG member name. To specify a database, you use the MailboxDatabaseName parameter and to specify a DAG member, you use the MailboxServerName parameter. When run interactively in the console, the script performs the redundancy check only once, and outputs the CurrentState (red or green) on the screen:

[PS] CheckDatabaseRedundancy.ps1 -MailboxDatabaseName “Mailbox Database 1928496050”

DatabaseName : Mailbox Database 1928496050
LastRedundancyCount : 0
CurrentRedundancyCount : 2
LastState : Unknown
CurrentState : Green
LastStateTransitionUtc : 5/11/2010 7:51:19 PM
LastGreenTransitionUtc : 5/11/2010 7:51:19 PM
LastRedTransitionUtc :
LastGreenReportedUtc : 5/11/2010 7:51:19 PM
LastRedReportedUtc :
PreviousTotalRedDuration : 00:00:00
TotalRedDuration : 00:00:00
IsTransitioningState : True
HasErrorsInHistory : False
CurrentErrorMessages :
ErrorHistory :

Like other scripts and cmdlets, CheckDatabaseRedundancy.ps1 can also be run in monitoring mode and generate events by adding the MonitoringContext parameter. This enables the script to be invoked by a monitoring solution, such as Microsoft System Center Operations Manager (SCOM). In monitoring mode, the script logs red alert and green alert events into the local server’s Application event log. A red alert event (event ID 4113) is fired only if the database has been “red” for 20 minutes more (in duration, not consecutive) in the hour-long run of the script, and a green alert event (event ID 4114) when the database has been “green” for 10 consecutive minutes. By default, once a red alert event is generated, it will continue to be reported every 15 minutes.

Below is an example of a red alert event (click to enlarge):

Single Copy Alert - Red Alert Event

Below is an example of a green alert event (click to enlarge):

Single Copy Alert - Green Alert Event

Note   These events will not appear as shown above until the event resource binary file containing the updated strings for this event is installed on the system.  This binary file (clusmsg.dll), which be updated with the first update rollup that includes the CheckDatabaseRedundancy.ps1 string (most likely update rollup 4).  Until then, the description of the event will read as follows: “The description for Event ID 4114 from source MSExchangeRepl cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.”  The lack of these strings in the event will not affect monitoring, as event 4113 always indicates a red alert (and it will contain the name of the database and errors that caused that database to be down to a single copy), and event 4114 will always indicate a green alert.

In addition, the script has some other useful options. For example, you can add the ShowDetailedErrors parameter to get greater detail about any errors that occur, and you can add the Verbose parameter for additional troubleshooting information. The script also includes a SendSummaryMailTos parameter which can be used to send a summary report by email to a list of specified email addresses when the script has finished running. This enables administrators to quickly look at hourly reports to see if any redundancy issues have occurred. If you do use the email functionality, you’ll need to include the SummaryMailFrom parameter whenever you use the SendSummaryMailTos parameter.

We recommend running this script regularly, as part of your normal monitoring operations. To ensure you don’t have lengthy periods in which database redundancy is compromised, run the script every 60 minutes. The script includes a parameter called TerminateAfterDurationSecs, which when set to -1 or 0 when executing the script, can be used to run the script for an infinite amount of time. If you’re not running a monitoring solution such as SCOM, you can create a Windows scheduled task to do automate and schedule script execution. However, be aware that there are known issues in the Windows 2008 SP2 Task Scheduler that may cause Task Scheduler to crash when you have scheduled a long-running task. These issues do not exist in Windows Server 2008 R2; so if possible, run the script from Windows Server 2008 R2.

If you can’t run the script from Windows Server 2008 R2, and you’re running it from Windows Server 2008 SP2, we recommend two modifications. First, instead of running the script with its built-in transient suppression of 60 minutes, run the script every 5 minutes by using the following parameters:

CheckDatabaseRedundancy.ps1 -MonitoringContext -SleepDurationBetweenIterationsSecs:0 -TerminateAfterDurationSecs:1 -SuppressGreenEventForSecs:0 -ReportRedEventAfterDurationSecs:0 -ReportRedEventIntervalSecs:0 -ShowDetailedErrors

Second, if possible, use SCOM to define the transient suppression behavior (e.g., if 3 red alert events are logged within a 20 minute period, generate an alert; and if a green alert event is logged, change the CurrentState to Green).

Here are the steps you can use to schedule this script:

  1. Copy the script to the Exchange server or management workstation from which you want to run it. Do not copy this into the \Scripts folder. Instead, choose a unique location for the script (for example, C:\Operations).
  2. Configure a scheduled task through the Windows Task Scheduler by running the following command:

schtasks /create /TN “Check Database Redundancy” /TR “Powershell.exe -NonInteractive -WindowStyle Hidden -command ‘C:\Program Files\Microsoft\Exchange Server\V14\bin\RemoteExchange.ps1′; Connect-ExchangeServer -auto; C:\Operations\CheckDatabaseRedundancy.ps1 -MonitoringContext -ShowDetailedErrors -SummaryMailFrom:’SMTPFromAddress@contoso.com’ -SendSummaryMailTos:@(‘SMTPToAddress@contoso.com’) -ErrorAction:Continue” /RU SYSTEM /SC HOURLY

Replace the parameters in the above script with the script parameters you want to use. Additional parameters for the script are also described in the script.

When using the schtasks command line tool to create a scheduled task, the /TR option is limited to 261 characters, which is easy to exceed when using multiple script parameters. The above example exceeds that limit. If the parameters and paths you use cause the /TR option to exceed 261 characters then you must manually create the scheduled task using the Task Scheduler applet on the Administrative Tools menu. Alternatively, you can download this XML file, edit it appropriately, save it, and import it using the Task Scheduler applet.

We’re releasing this script to you now because we think it is very important that all customers monitor for situations in which database redundancy is compromised and immediately take action to restore database redundancy and avoid catastrophic data loss. Eventually, a version of the script will be released in a forthcoming update rollup for Exchange 2010 (most likely Update Rollup 4), and after that it is expected to ship in Service Pack 1. Note that when it does ship with SP1, the Release Notes may include updated information for scheduling the script to run regularly on your servers.

We hope you find this useful, and welcome your feedback. You can download the script here.

Scott Schnoll


Comments (14)
  1. Constantino Tobio says:

    I’ve set this up as a scheduled task on one of my mailbox servers, but the event log gives me this:

    Log Name:      Application

    Source:        MSExchangeRepl

    Date:          5/21/2010 2:39:57 PM

    Event ID:      4114

    Task Category: Service

    Level:         Information

    Keywords:      Classic

    User:          N/A

    Computer:      

    Description:

    The description for Event ID 4114 from source MSExchangeRepl cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

    If the event originated on another computer, the display information had to be saved with the event.

    The following information was included with the event:

    DB11

    2

    the message resource is present but the message is not found in the string/message table

  2. Constantino Tobio says:

    One other issue with the script- it declares a red alert for a perfectly functining DB in a DAG that intentionally does not have a replica (like, a test DB or a restore DB). Short of running this once for every DB I have, is there a way to tell the script to ignore these types of DBs?

  3. Constantino Tobio says:

    Disregard the first comment- I failed to RTFM. :)

  4. Exchange says:

    Constantino, you did not fail to RTFM.  I added that note about the missing strings after your first comment.  It should have been in the blog post in the first place. Sorry about any confusion.

    As for your second question, yes there is!  The script has a SkipDatabasesRegex parameter which is used to skip the default databases, but can also be used to append any other databases you want to exclude from monitoring.  For example, if you want to to ignore the default databases and two test databases (mdb2 and mdb3), you would run

    .CheckDatabaseRedundancy.ps1  -SkipDatabasesRegex:”^(Mailbox Database d{10})|(mdb2)|(mdb3)$”

    Hope this helps.

  5. Constantino Tobio says:

    Excellent, I’ve implemented the regex and it is indeed working. I’ve put this script into service and it’s doing it’s thing. Glad I also did not fail to RTFM.

    I may also build some sort of alerting on DB Mount from perfmon- I’d like to actually know when a DB switches the server it is active on and get alerted on that.

    Perfmon/DB Active Manager/Database Mounted is perfect for this.

  6. Andy Forex says:

    I’ve implemented the regex and it’s working thanks for posting this!

  7. Rico says:

    Is there a way to skip LAG copy?

    Otherwise script always result with error:

    Passive copy ‘LAG copyserver’ has a replay queue higher than the warning threshold of ‘500’

  8. GiveMeABreak says:

    And hilariously this is not included in the Exchange 2010 Management pack for SCOM.  Why do you guys always think of the obvious so late in the game?  Its not like we are talking about protecting data that is NOT being backed up via traditional means anymore….oh wait.

    Here’s a tip:  Try integrating critical products from the same company with releases of new versions of Exchange.  Telling customers "Wait until Rollup X" doesn’t cut it when you release with half baked monitoring solutions and missing functionality.  Yes I’m ranting, however I believe its justified as the history of this behavior goes back many years now.

    Even funnier is this is exposing one of the million reasons not to go down the JBOD path unless you are an extremely small/cash strapped shop.

  9. GiveYourselfABreak says:

    If you don’t like the product, you can always take your company’s money elsewhere, right?  No one forced you to use Exchange 2010 (if you even do), so why such sour grapes?  If it doesn’t fit your requirements, you should not have wasted corporate funds on such a product…so who is at fault?

  10. Karsten says:

    I tried the script – it says we have a problem. The CurrentRedundancyCount shoud be 2 – but it is 0. the Error messages (see below) even states that there are 2 copys.

    How can I find out whats wrong?

    DatabaseName             : A-MBDB01

    LastRedundancyCount      : 0

    CurrentRedundancyCount   : 0

    LastState                : Unknown

    CurrentState             : Red

    LastStateTransitionUtc   : 27.05.2010 11:13:11

    LastGreenTransitionUtc   :

    LastRedTransitionUtc     : 27.05.2010 11:13:11

    LastGreenReportedUtc     :

    LastRedReportedUtc       : 27.05.2010 11:13:11

    PreviousTotalRedDuration : 00:00:00

    TotalRedDuration         : 00:00:00

    IsTransitioningState     : True

    HasErrorsInHistory       : True

    CurrentErrorMessages     : {Passive copy ‘A-MBDB01CNEXD01B’ is not UP according to clustering., Active copy ‘A-MBDB01

                              CNEXD01A’ is not UP according to clustering.,

                              Name                             Status       RealCopyQueue      InspectorQueue         Repl

                              ayQueue             CIState

                              —-                             ——       ————-      ————–         —-

                              ——-             ——-

                              A-MBDB01CNEXD01B               Healthy                   0                   0

                                    0             Healthy

                              A-MBDB01CNEXD01A               Mounted                   0                   0

                                    0             Healthy}

    ErrorHistory             : {CheckHADatabaseRedundancy.DatabaseRedundancyEntry+ErrorRecord}

  11. Exchange says:

    Rico, you can use the RegEx option to skip lagged copies, too.

    Karsten, based on your output, it looks like one of your DAG members does not have the Cluster service running.  Please verify that the cluster service is running on all DAG members.

  12. Karsten says:

    Thanks for answering. I checked the Clusterservice – but it was ok. But we found the error: As you write in  the script itself – there is one part locale dependent.:

    intead of:

    $script:clusterNodeStateTable[$serverName] -ieq "Up")

    in german (we use both – os and exch in german) it has to be:

    $script:clusterNodeStateTable[$serverName] -ieq "Aktiv")

    Now it works.

  13. Exchange says:

    Karsten, thanks for the update.  I’ll pass it along to the developers.

  14. Rhoderick Milne says:

    You could specify the locale to use for the output.  Read this article, which has the same issue.

    http://blogs.msdn.com/b/virtual_pc_guy/archive/2010/05/18/handling-international-wmi-clients-servers-with-hyper-v.aspx

Comments are closed.