What’s the difference between High-availability and Live Migration?

As I talk with customers and partners, too many times people think that Live Migration is High-availability.  It’s not.  Live Migration can help create a highly available workload, but Live Migration does nothing if the server delivering your workload fails.  I want to take some time to talk about about High-availability and Live Migration and what they can and can’t do.  First let’s define High-availability and Live Migration and where they are used.  Then we will talk about where they can and can’t meet your needs / expectations. I will wrap up by answering a few of the common questions, including how long will it take? that typically come up.

High-Availability

I consulted Wikipedia for the definition of a High-Availability cluster and it does a good job of defining the way Windows delivers High-availability.  The article is a good read, but for the sake of this discussion, the first few sentences gives us what we’re looking for:

High-availability clusters (also known as HA Clusters or Failover Clusters) are computer clusters that are implemented primarily for the purpose of providing high availability of services which the cluster provides. They operate by having redundant computers or nodes which are then used to provide service when system components fail. Normally, if a server with a particular application crashes, the application will be unavailable until someone fixes the crashed server. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as Failover.  As part of this process, clustering software may configure the node before starting the application on it.

I like this definition because it is straight forward.  I’ve talked before about setting up a HA cluster, so I won’t repeat that here, but take note that a HA cluster means two or more nodes that are able to work together to detect a failure and then automatically restart a workload.

Live Migration

I consulted Wikipedia again for a definition of Live Migration, but their definition still needs work so I’m pointing to Microsoft’s definition of Live Migration here:

Hyper-V and failover clustering can be used together to make a virtual machine that is highly available, thereby minimizing disruptions and interruptions to clients. Live migration requires the failover clustering feature to be added and configured on the servers running Hyper-V and allows you to transparently move running virtual machines from one node of the failover cluster to another node in the same cluster without a dropped network connection or perceived downtime.

The first thing I want to point out is that Live Migration uses failover clustering.  This is the same failover clustering that High-availability uses.  The good news is that once you’ve configured your servers for High-availability, you are setup to support Live Migration as well.  Here’s where it gets confusing:  Our definition says: “Hyper-V and failover clustering can be used together to make a virtual machine that is highly available”.  From our definitions, High-availability (HA) and highly available mean two different things!  And that’s where it gets confusing.  Let me explain…

High-availability (HA) means that if one server node fails, the workload gets restarted on a different node, the redundant node.  A highly available workload (Virtual Machine) as defined by Live Migration means that we can move the workload before the server its residing on goes down.  Live Migration requires both the current and destination hosts be running during a Live Migration process.  A High-availability action, as defined above, only takes place when the current host for a workload fails. 

Can and Cant’s

High-availability can restart a workload if the host server fails. 

You will have an outage if the host server fails, the redundant node will take ownership of the failed workload and restart the workload.  The virtual machine will have to go through the complete boot process.  How long will this take?  Check the bottom of the post, I’ve walked through the recovery process there.

High-availability can’t prevent an outage of your workload.

High-availability says that if the host node suffers an outage, the redundant node will restart the workload.  Remember, the host node has failed, our workloads suffered a dirty shutdown and must now be restarted.

Live Migration can move a workload from one host to another.

Live Migration will move the running workload without a perceived loss of connectivity as long as both nodes remain up an running during the Live Migration process.  This is where we can create a highly available workload because we can proactively move a workload to a different host before we take the first host offline for maintenance.  If either node fails during the Live Migration process, the Live Migration is aborted. 

Live Migration by itself can’t give you High-availability for unplanned outages.

Live Migration can help you proactively move to another node, but if the host node fails, Live Migration will do you no good.  For Live Migration to give you true High-availability, you must be able to predict the failure of your host node <grin>.  System Center Operations Manager (SCOM) is able to detect failures in a server, so let’s say a server has redundant power supplies.  When one power supply fails, SCOM can detect the power supply failure and initiate an evacuation of that host node since it would now be at risk of a failure if the second power supply fails.

Live Migration can make your workload Highly available for planned outages.

Again, if you are planning maintenance for a host node, Live Migration can allow you to move your workloads to another node so you can take your original host node offline for maintenance. 

Common Questions and Answers

  • What if a host fails during a Live Migration?

If the destination node fails, the Live Migration fails and the workload remains running on the original host node.  The workload will remain functional and there will not be a loss in availability of the workload.

If the original host node fails before the completion of the Live Migration, the Live Migration will fail and the workload will go down with the original host node.  You will suffer an outage of the workload.  During this event, another node in the cluster (it may or may not be your desired destination node) will take ownership of the workload and reboot the workload (VM).  This is a normal reboot process and will take a few minutes to bring the workload back online.

  • If I lose a host node, how long does it take for my workloads to become available again?

When a host server fails, say from a power loss, the server stops functioning and all of the workloads immediately stop functioning.  At this point, we have suffered a dirty shutdown for the host node and all of the virtual machines running on that host node.  The redundant node(s) take(s) ownership of your workload(s) and restarts them. 

  • What is a Dirty Shutdown and what does that mean?

A Dirty Shutdown is when an application or operating system is turned off without notice.  Think of a power loss.  The server immediately loses power and is unable to conduct an orderly shutdown.  A dirty shutdown can cause data corruption.  These days, most modern operating systems and applications are able to recover from a dirty shutdown without losing data, but before they start delivering service again, they typically need to do some type of integrity check and data correction to ensure that their databases and logs are ready to deliver service.  This recovery phase is critical.  Typically the user is not able to interrupt the recovery process, but even if you can, please let the recovery process complete uninterrupted.  Skipping the recovery process after a dirty shutdown is just asking for more trouble.  This is a normal phase of a recovery process and takes place during the OS boot and/or application startup.  This recovery phase adds additional time to the restart process.

How long will it take?   That’s a good question and the exact answer is “it depends”.  During an HA event, your redundant node(s) can restart multiple workloads simultaneously.  Typically it takes a lot less than 15 minutes to bring your workloads back online.  During the planning phase when we talk about planning for an HA event, I use 15 minutes as the outage window.  I use 15 minutes because some workloads will take longer than others to come back up.  If a workload is available in less than 15 minutes, you look like a hero!  I’d caution you against committing to less than a 15 minute outage, especially for workloads like Exchange or SQL.  When these workloads suffer a dirty shutdown, they will examine the integrity of their databases before they bring their services online.  These integrity checks take time, but current versions of SQL and Exchange can usually be back online in under 15 minutes.

  • How long does Live Migration take?

The time a Live Migration takes varies due to the size of the workload being migrated, the utilization of the workload, host server, destination server, and network I/O.  There are other factors, but these are the biggest items to consider.  Live Migrating a workload is a higher priority task, but if your machines and/or network are too taxed, Live Migration will take longer to complete.  The good news is that the length of the Live Migration process does not impact the availability of your workload.  Your workload will be delivering service to your users throughout the whole Live Migration process.

  • How long does it take to evacuate all of the workloads from a host?

It depends!  Only one Live Migration can take place at a time on a source or destination server.  If we have a four node cluster, Nodes A, B, C, & D, we could have a Live Migration occur between hosts A&B and another between hosts C&D, but we cannot have more than one Live Migration “event” occur on each host at a time.  System Center Virtual Machine Manager (SCVMM) can queue up Live Migrations so they occur in the order you specify, or you have SCVMM evacuate all of the workloads from a node of your cluster. 

My intent was to create a short post this time around, but you know how it goes.  I hope this information has helped and please feel free to shoot me any questions or comments.

Until next time,

Rob

del.icio.us Tags: High-availability,Live Migration,difference

Technorati Tags: High-availability,Live Migration,difference