Continuing from where we left off the last time…
Q1: The Replication Health pane has loads of information, how do I interpret these attributes?
I thought you would never ask
- On both the Primary and Replica VM, the following attributes are displayed:
Replication State Refers to the current state of the replicating VM. The set of values are captured in Q3 under the UI column, in the previous post Replication Type Indicates whether the VM a Primary VM or a Replica VM Current Primary Server Provides the FQDN of the server on which the primary VM resides Current Replica Server Provides the FQDN of the destination server on which the replica VM resides Replication Health Refers to the health of the replicating VM. The set of possible values are Normal, Warning or Critical. From Time The start time for the monitoring interval To Time This calls out the current time or time at which this window was launched
The statistics below are collected between ‘From Time’ and ‘To Time’
Average size The average size of the replica file (sent in every replication interval) Maximum size Maximum size of the replica file Average Latency Average time taken to transfer the replica file in a replication interval Errors encountered Number of errors encountered (eg: network disconnects, resynchronization etc.) Successful replication cycles Hyper-V Replica attempts to send the replica file every 5 mins (=>the number of replication cycles in an hour is 12). This counter captures the number of successful replication attempts.
Last synchronized at: This refers to the last time the replica was sent to the primary server (or) received and applied in the replica server. The difference between the current time and this value, indicates the loss of data (measured in time) if a failover is initiated.
2. On the primary VM:
- Size of data yet to be replicated: This refers to the size of the replica file which is being tracked but not sent to the replica server yet. The value signifies the loss of data (measured in MBs) if a failover is initiated on the replica VM.
3. On the Replica VM:
- Test failover status: If Test failover is enabled at the time of measuring the statistics, then this attribute is set to ‘Running’ – else, it is set to ‘Not Running’
Last Test Failover initiated at: This refers to the wall-clock time when the last test-failover operation was initiated.
- Refresh: Refreshes the statistics by updating the ‘To time’.
- Reset Statistics: Zeros out the statistics for the current interval and starts afresh. You would typically used this option after rectifying a problem
- Save As: Saves the monitoring information as a CSV file which can be archived
Q2: Some follow-up questions – what is the replication interval? How can I change it?
Hyper-V Replica tracks the writes to the VM in a log file. This log file is sent every 5mins which is also called the replication interval. Administrators cannot configure this interval.
Due to network or storage issues or due to excessive churn in the VM, the replica file transfer might take more than 5mins to reach the destination server (and applied to the replica VM). Hyper-V Replica has inbuilt semantics to handle such situations by delaying the transfer of the next replica file. This impacts the ‘Successful Replication Cycles’ and Average Latency statistics.
Q3: What is the Monitoring Interval and Monitoring Start Time and how do I get/set this?
The monitoring interval is a server level attribute which refers to the interval for which the replication statistics are captured and computed. This attribute can be viewed from Get-VMReplicationServer
The MonitoringInterval refers to the time interval for which the replication statistics should be collected. By default this is set to 12 hrs. The minimum value which can be set is 1hr and the maximum value is 7 days. It is recommended that a reasonably high value is used as smaller intervals might lead to incorrect conclusions.
The MonitoringStartTime refers to the time at which Hyper-V Replica should start monitoring the replicating VM. The input is denoted in a 24hr clock and is set to 9AM local time by default.
Both these values can be changed using the Set-VMReplicationServer. Eg: To modify the Monitoring interval to 12hrs and start time to 6AM, issue the following cmdlet:
In this example, when a VM is enabled for replication at 2pm, statistics are collected from 2pm to 6pm on the same day and health is reflected for this interval. The statistics are then reset and collected 6pm to 6am the next day and health is reflected for this interval.
Q4: Are the statistics from the previous monitoring intervals available?
Yes. In the event viewer, under the Hyper-V VMMS node, an Information message is recorded. The event ID for this is 29174.
Q5: Is the health attribute preserved when the VM migrates?
Yes, when the VM migrates from one node to another, the replication statistics are preserved and used in the new node.
Q6: I manage a N-node-cluster with many replicating VMs, I (obviously) cannot click on each VM to know it’s health. Is there an easier way to manage from UI?
Yes! From the Failover Cluster Manager you can run a query to get VMs with a specific replication Health. Under Roles, click on ‘Add Criteria’, choose ‘Replication Health’ and specify the criteria (Critical/Normal/Warning)
Q7: Is there any such provision in the Hyper-V Manager?
You can add the column ‘Replication Health’ from the Add/Remove Column option in Hyper-V Manager
Q8: Is there a PowerShell cmdlet to get all this information?
Yup, Measure-VMReplication captures the health and state related information
Q9: Is this sufficient to build an alerting mechanism?
Yes, using the cmdlet, you can set up custom warnings, send mail, run it frequently from Task scheduler etc. The options are limitless.
Our resident PS expert Rahul Razdan, has this nifty PS script which sends out a mail detailing the health of the replicating VM.
The script which works for a standalone node can be easily extended to query across a cluster (using Get-ClusterNode). Give it a shot in your deployment!
In summary, it is extremely important to monitor the health of the replicating VMs. The system has inbuilt retry semantics to address transient issues (eg: network outage) but there are certain events which require your intervention (eg: disk issues). Analyzing the replication health from time to time will help you identify and fix these issues.