The Case of the Pending VM Snapshot Merge

The case of the pending merge.

Guest post from reader Jeremy Hagan

Recently I had a virtual machine stop responding. Upon investigation I noticed that the machine was paused. This usually happens when the underlying disk has run out of space and it was the case in this instance as well. I thought this was unusual since I have followed the Microsoft recommendations in my production Hyper-V environment by having one LUN per VM and using fixed-size VHD files, so there was nothing that should grow to fill up the disk. Digging further I found that the machine was actually running on a differencing disk, not a fixed size VHD and that this differencing disk (AVHD) was there because there had been a snapshot in the past that was still waiting to merge into the parent VHD file.

This raises an interesting issue. When a snapshot is deleted, the virtual machine must be powered off, not merely rebooted, in order for the data in the differencing disk to merge into the parent VHD. The VM must remain powered off until the merge is complete. If the machine is booted back up again before the merge is completed then the merge process stops until the machine is powered off again. The bigger the AVHD, the longer the merge takes. And while SCVMM and Hyper-V manager have an indication in the GUI that a snapshot exists, when a snapshot is deleted it is deleted from the GUI straight away even though the snapshot hangs around until the merge is complete. So if there is a VM in this state there is no visible indication of it.

Back to the problem at hand. It appears that the machine had a deleted snapshot for an unknown time period and that this had grown to the point where it filled the underlying disk. Since quite a few of our LUNs that run VMs are full to within a few percent, we have disabled the disk space alerts on all LUNs that host VMs, so we were never alerted to the impending issue. The problem was exacerbated by the fact that the LUN was now so full that the merge could not take place even if the VM was turned off.

Luckily we maintain a 250 GB LUN on each Hyper-V cluster that is used for staging and emergencies, so it was an easy thing to power the machine off and use System Center Virtual Machine Manager (SCVMM) to migrate the VM over the network to this larger LUN and allow the merge to complete and then move it back to its home. Problem solved.

Preventing a repeat occurrence

Obviously our strategy of disabling free disk space monitoring on LUNs hosting VMs had bitten us. I needed a strategy to prevent a repeat occurrence of this issue. I could always re-enable the free-space monitoring of the LUNs, but this seemed counter-productive. Since the LUNs would routinely be running low on space then SCOM would be routinely raising Chicken Little alerts that could be ignored, but when the sky really was falling I’d need to pay attention. Cleary a different approach was required and what I really needed was a way to alert on a pending merge.

After a lot of Internet research, and trawling through the output of WMI objects and the registry for a suitable indication, I came up with nothing. So, decided I needed to examine the difference between virtual machines in three different states, namely:

  • one without a snapshot
  • one with a snapshot
  • one with a deleted snapshot

So I started by downloading the PowerShell Management Library for Hyper-V and trawling through the various settings on snapshots that were available to me. I thought I was onto something when I saw that a VM with a pending merge still had the disk listed as the AVHD file, not the the parent VHD. I managed to write a complicated script that went something like this:

  1. Count the number of snapshots
  2. Recursively check the parent of the AVHD file(s) until you find the parent is a VHD
  3. Compare the number of links to the number of snapshots
  4. If there is a mismatch then there is a pending merge

Now, this worked in the majority of contrived cases, but it all fell down if a disk had been added after a snapshot was taken and was difficult to code when there were multiple disks. Finally I decided to go diving into the configuration XML file. I was hesitant to fiddle too much with the Hyper-V configuration files. They might be simple XML, but it is not like there are a lot of technical articles out there that advocate editing the configuration file.  At this point I was out of ideas.

After a bit of poking and fiddling in a test machine’s configuration file I eventually came across the mother lode: configuration/global_settings/disk_merge_pending = true. This is the point where I feel kind of stupid spending the amount of time I had already spent up to this point when the solution ended up being so simple. But enough wallowing in self-pity, the problem was not solved yet.

Implementation

On to the implementation part of the story. You’d think it would be simple now, but there were more issues to cover. I intended to monitor this with SCOM and this introduced a couple of complications. Both PowerShell and VBScript offer native support for parsing XML. Doing it in PowerShell is dead easy and doing it in VBScript is more complicated and unintuitive for a poor sysadmin such as myself. The problem is, support for embedding a script into a SCOM monitor or rule is restricted to VBScript.

Let me take a short time to digress and discuss my implementation woes with PowerShell. The default settings for running PowerShell scripts is to only run signed scripts and there are three setting available to you:

  • Unrestricted: Forget about signing and just run any script
  • RemoteSigned: Require signing on any script that is considered remote. What is considered remote? This post from the PowerShell team will educate you.
  • AllSigned: Require signing on all scripts

From the blog post on “what is considered remote” you will find that if you have IEESC turned on that the Intranet zone is considered remote, so any script run from a network share is considered remote. I would have been happy to create my own code-signing certificate, but I found that unless you have an Enterprise CA running on Windows Server Enterprise Edition that you can’t create your own code-signing CA with your Microsoft certificate authority. I was stuck. My choices, none of them palatable, were:

  • Buy a commercial code signing certificate.
  • Turn off IEESC and run the risks of lazy sysadmins browsing the Internet from the server.
  • Distribute all scripts to local disk and run them from there and have to manage the distribution of scripts and their updates.
  • Configure the PowerShell script-signing policy to Unrestricted and reduce the security posture of my network.

I’ll leave this side issue here since I have explained all the relevant background and get back to the issue of implementing my SCOM monitoring of pending merges.

So, I was left with my conundrum about which scripting language to implement my SCOM monitor in. VBScript allowed me to run the script straight out of SCOM and provides a nice way of returning output from the script to be tested by the monitor and reported in the alert. PowerShell offered the path of least resistance in writing the script. The I had a brainwave. Why not have the best of both worlds? Here is what I came up with:

  • Use VBScript
  • Have the VBScript create a temporary PowerShell script in the current directory. Because the VBScript is distributed by SCOM it will be on local disk, solving my script-signing problem. I can now use RemoteSigned and keep my servers safe.
  • Execute the PowerShell script by using the Windows Script Host Exec Method, which allows you to capture the output using the StdOut object of the Exec method
  • Return the state of the Hyper-V host (whether or not there was a pending merge) and the name of the VM with the pending merge to the monitor via the property bag API
  • Set the machine to a Warning state and raise an alert including the output of the script

This setup worked great! I was getting an alert and a Warning state when the pending merge was detected and the alert would auto close when the pending merge situation was resolved and the monitored Hyper-V host would be returned to a Healthy state. Unfortunately it was still not good enough. I could imagine a case where an alert would be raised, but in the time taken for you to arrange an outage for the VM to allow the merge to complete, another machine had its snapshot deleted and also has a pending merge. The first alert would indicate that a certain host and a certain VM had a pending merge, but another alert wouldn’t be raised for the second pending merge. When you resolve the first pending merge the second one would prevent the alert from closing.

My solution was to create a rule to run the script and a monitor to alert on the results. I modified my script and abandoned the property bag and just wrote the information to the local Application event log and the monitor would raise the alert based on that information. I configured the rule to run every 24 hours and for the alert to expire after 23 hours, so SCOM basically bugs me every day about the pending merge until I resolve it. Of course you could customise this in your own implementation.

PowerShell

foreach ($file in Get-ChildItem "$Env:PROGRAMDATA\Microsoft\Windows\Hyper-V\Virtual Machines\*.xml" {
$file.CopyTo((Split-Path -parent $MyInvocation.MyCommand.Definition) + "\" + $file.Name) | out-null
[xml]$ConfigFile = Get-Content ((Split-Path -parent $MyInvocation.MyCommand.Definition) + "\" + $file.Name)
$VMName = $ConfigFile.Configuration.properties.name."#text"
$MergePending = $ConfigFile.configuration.global_settings.disk_merge_pending."#text"
if ($MergePending -eq "True") {Write-Host "$VMName has a pending merge."}
del ((Split-Path -parent $MyInvocation.MyCommand.Definition) + "\" + $file.Name)
Remove-Variable VMName
Remove-Variable ConfigFile
Remove-Variable MergePending
}

I decided to copy the configuration files to a temporary location before parsing them just in case Hyper-V didn’t cope well with manipulating them.

VBScript with embedded PowerShell

Option Explicit

Dim objFSO : Set objFSO = CreateObject("Scripting.FileSystemObject")
Dim objShell : Set objShell = CreateObject("WScript.Shell")
Dim strParentPath : strParentPath = objFSO.GetParentFolderName(WScript.ScriptFullName)
Dim objPSFile : Set objPSFile = objFSO.CreateTextFile(strParentPath & "\PSTemp.ps1", True)

'Get the path to the PowerShell executable from the registry
Dim strPowerShell : strPowerShell = objShell.RegRead("HKLM\SOFTWARE\Microsoft\PowerShell\1\ShellIds\Microsoft.PowerShell\Path")
Dim strCommand : strCommand = Chr(34) & strPowerShell & Chr(34) & " -NoProfile -NoLogo -File " & Chr(34) & strParentPath & "\PSTemp.ps1" & Chr(34)
Dim objExec
Dim strOutput : strOutput = ""
Dim i

objPSFile.WriteLine "foreach ($file in Get-ChildItem " & Chr(34) & "$Env:PROGRAMDATA\Microsoft\Windows\Hyper-V\Virtual Machines\*.xml" & Chr(34) & ") {"
objPSFile.WriteLine " $file.CopyTo((Split-Path -parent $MyInvocation.MyCommand.Definition) + " & Chr(34) & "\" & Chr(34) & " + $file.Name) | out-null"
objPSFile.WriteLine " [xml]$ConfigFile = Get-Content ((Split-Path -parent $MyInvocation.MyCommand.Definition) + " & Chr(34) & "\" & Chr(34) & " + $file.Name)"
objPSFile.WriteLine " $VMName = $ConfigFile.Configuration.properties.name." & Chr(34) & "#text" & Chr(34)
objPSFile.WriteLine " $MergePending = $ConfigFile.configuration.global_settings.disk_merge_pending." & Chr(34) & "#text" & Chr(34)
objPSFile.WriteLine " if ($MergePending -eq " & Chr(34) & "True" & Chr(34) & ") {Write-Host " & Chr(34) & "$VMName has a pending merge." & Chr(34) & "}"
objPSFile.WriteLine " del ((Split-Path -parent $MyInvocation.MyCommand.Definition) + " & Chr(34) & "\" & Chr(34) & " + $file.Name)"
objPSFile.WriteLine " Remove-Variable VMName"
objPSFile.WriteLine " Remove-Variable ConfigFile"
objPSFile.WriteLine " Remove-Variable MergePending"
objPSFile.WriteLine "}"
objPSFile.Close

Set objExec = objShell.Exec(strCommand)
objExec.StdIn.Close

'sanity check
Do While objExec.Status = 0
WScript.Sleep 1000
i = i + 1
If i > 55 Then
i = 0
objExec.Terminate
End If
Loop

Do While True
If objExec.StdOut.AtEndOfStream Then
Exit Do
Else
strOutput = strOutput & objExec.StdOut.Read(1)
End If
Loop

If InStr(strOutput, "pending") Then objShell.LogEvent 2, strOutput

objFSO.DeleteFile strParentPath & "\PSTemp.ps1", True

Set objFSO = Nothing
Set objExec = Nothing
Set objPSFile = Nothing
Set objShell = Nothing

SCOM Rule

image

image

image

SCOM Monitor

image

image

image

image