VMware vSAN test pilots: Don't panic but there's a chance of DATA LOSS

https://www.theregister.co.uk/2013/09/30/vmware_warns_of_possible_vsan_beta_data_loss/

 

VMware's vSAN isn't just giving storage appliance vendors a lot to worry about: it's also giving users plenty to consider because it erases data under some circumstances. Panic not, gentle readers: the tool is in beta and Virtzilla's engineers are onto the problem. But the problem is out there, and VMware has been kind enough to detail it here.

News of the data loss problem is well below the fold in that article, after discussions of the desirability for pass-through mode to be enabled on RAID controllers when running up a vSAN. The explanation of why, excerpted below, offers some interesting insights into vSAN's plumbing:

VSAN uses magnetic disks as the persistent store for the data on the VSAN datastore and Flash as a performance acceleration layer – a read cache and write buffer – in front of the magnetic disks. All writes go to the flash layer, and all reads are first tried from the flash layer. This design obtains the lowest $/GB (using magnetic disks) and the lowest $/IOP (using Flash). While magnetic disk drives provides a low $/GB, they only support limited IOPs.

VSAN directly manages the magnetic disks, published via the pass-thru controller, in a way that the limited IOPs on the magnetic disk are used in the most optimal way. To do so, VSAN implements a proximal IO algorithm. The proximal IO algorithm is used to de-stage writes from the Flash device that is “approximately” close to each other on the magnetic disk. This design addresses the “I/O blender” situation where sequential I/O from a VM can become random when multiple VM are doing I/O to the same disk. The VSAN proximal IO algorithm turns the random I/O from the I/O blender back into sequential I/O, thus improving performance.

All of which is very interesting, but you're really interested in data loss, aren't you, lest you mess things up during some beta testing.

The good news is that data loss problems will only strike users of the Advanced Host Controller Interface (AHCI) SATA controller, which VMware says “has known issues with VSAN.”

“This manifests itself as disks/controller going into degraded mode and resulting in PDL (Permanent Device Loss),” VMware writes. “This could result in data loss and VSAN becoming unavailable.”

That's VMware's bold tagging by the way.

The good news is that “The VSAN team is actively looking at this issue”, which probably means lots of chats with Intel, the source of AHCI. Virtzilla has also signed off on a handful of other RAID controllers from IBM, HP, Dell and LSI, so it's not as if the AHCI problem means vSAN is too dangerous to touch in beta form. ®