Everything you wanted to know about SR-IOV in Hyper-V. Part 1

Now that the veil of silence has been lifted with the release of Windows Server “8” beta , I’m able to tell you a little more about what I’ve been working on for 5 years now. An inside joke is that it’s a feature with two checkboxes in the UI of Hyper-V Manager, so about 2½ years per checkbox! Well clearly it’s a little more than that! A lot more, actually!!

SR-IOV stands for Single-Root Input/Output (I/O) Virtualization. It’s is a standard defined by the PCI Special Interest Group. If you work for one of the member companies who have access, and are after some light bedtime reading, the specs are available on their website.

To keep it at a high level (as in all you probably need to know to use the feature), SR-IOV is a standard which describes how a PCI Express device can be constructed so that it works well with modern virtualization solutions. You may be wondering, what does “work well” mean when we already have a great device I/O sharing model present in Hyper-V in both Windows Server 2008 and Windows Server 2008 R2? Good question. Before I answer that, let’s take a diversionary look at a diagram many of you will have seen variations of dozens of times before.

Hyper-V Simple Architecture

In the above diagram (I’m using networking for my example, but the same principles apply to storage), the physical device is “owned” by the parent partition. The parent is the arbiter for all traffic originating from VMs to the outside world and vice versa. The parent is also responsible for all policy decisions regarding how the device behaves such as link speed in the case of a networking device.

Emulated versus Software Devices
Virtual machines “see” either emulated devices (such as the Intel/DEC 21140), or software based devices (commonly referred to as either “synthetic” devices, a term I personally try and avoid ever using, or “paravirtualised” devices) which are designed to work well in a virtualised environment. In both these cases, these devices aren’t “real” devices physically present in the actual hardware you can touch. In fact, in the case of a software based device, it’s a completely made up fabricated device. You can’t go to your local store and buy one as it doesn’t exist in the physical world. Software based devices take advantage of our high-speed inter partition communication mechanism, VMBus, to efficiently pass data between the parent partition and a virtual machine. Software based devices are far more efficient than emulated devices for four main reasons:

  • First, we don’t (generally) need the Hypervisor to be involved in the hot-path for transfer of data. An emulated device requires many Hypervisor intercepts for every single I/O making it very expensive from a performance perspective which is why all of our supported operating systems also include drivers for software devices.
  • Second, VMBus uses shared memory buffers and memory descriptors to move data between the parent partition and the virtual machine. Shared memory access across partitions is extremely fast, especially as system architecture and hardware technologies have improved over recent years.
  • Third, we don’t need to literally emulate a physical device in user mode in the worker process, instruction by instruction, as we do for emulated devices. Software devices do not require emulation.
  • Fourth, VMBus interfaces (for at least networking and storage) runs in kernel mode in both the VM and the parent partition. We don’t need to transition up to user mode in the parent partition to complete I/Os, consuming additional compute cycles.

 

While software based devices work extremely efficiently, they still have an unavoidable overhead to the I/O path. For example, for security reasons, we sometimes (but not always, depending on direction and I/O class) need to copy data buffers between a virtual machine and the parent partition. No software can run with zero overhead no matter how much fine tuning is applied. And believe me, we spend a lot of time tuning Hyper-V for performance. Consequently, software based devices introduce latency, increase overall path length and consume compute cycles.

Ultimately, the day will come where software alone will not be able to keep up with link speeds. With RDMA, we are already getting close. One very approximate figure, and deliberately so, is looking at how much compute resource is required for Ethernet based network I/O. This depends hugely on processor class and vendor driver, but today a single core could be consumed by between 5 and 7 GB/sec of networking traffic generated by virtual machines using Windows Server 2008 R2 SP1. Furthermore, as line rates increase, 40 Gigabit Ethernet and 100 Gigabit Ethernet already standardised, we have to look at how we can scale Hyper-V I/O effectively in a virtualised datacentre.

Now I’m not saying that SR-IOV is only useful in 40 and 100 Gigabit environments. Absolutely not! But with 10 GigE hardware rapidly being adopted, the time is right to look at alternate more efficient mechanisms for device I/O which will continue to scale well in the future.

To answer therefore what “works well” means as I mentioned earlier, it means a secure device model which has, relative to software based device sharing I/O, lower latency, higher throughput, lower compute overhead, and scales well in the future. These are all met by SR-IOV.

So now you understand a little more about why Microsoft has been investing in SR-IOV and Hyper-V in Windows Server “8”, in the next part, I’ll start getting into the detail.

Cheers
John.