Everything you wanted to know about SR-IOV in Hyper-V Part 2

In part 1, I discussed why Microsoft has been investing in SR-IOV for device I/O from virtual machines. The key points were to reduce latency, increase throughput, lower compute overhead, and for future scalability. Part two takes a look at SR-IOV from a hardware perspective.

For those who didn’t take up my offer of the light bed time reading of the SR-IOV specs (or don’t have access), let me summarize what SR-IOV is. And to be clear, when I say “SR-IOV”, take it to include closely associated specifications or additions to PCI Express specifications such as

  • ATS (Address Translation Services) – A PCI Express Protocol that allows a device to fetch translations from an IOMMU (Input/Output Memory Management Unit).   
  • ARI (Alternate Routing Interpretation) – A PCI Express switch change and device change which allows a device to occupy more than eight RIDs on a single bus
  • ACS (Access Control Services) – A PCI Express switch feature that forces peer-to-peer traffic upstream so that it can be translated by an IOMMU.

First I’m going to say what SR-IOV isn’t. Most importantly it doesn’t refer to anything about I/O classes, and as such doesn’t have a single mention of networking, storage or other I/O classes. It doesn’t describe anything about how software should be designed to use SR-IOV capable hardware. It’s just about hardware. It doesn’t therefore either describe a driver model which is essential for a complete solution, or hardware specific nuances (one of which I mention at the end of this post) relevant to any I/O class.

The SR-IOV specs do however describe how a hardware device can expose multiple “light-weight” hardware surfaces for use by virtual machines. These are called Virtual Functions, or VFs for short. VFs are associated with a Physical Function (PF). The PF is what the parent partition uses in Hyper-V and is equivalent to the regular BDF (Bus/Device/Function) addressed PCI device you may have heard of before. The PF is responsible for arbitration relating to policy decisions (such as link speed or MAC addresses in use by VMs in the case of networking) and for I/O from the parent partition itself. Although a VF could be used by the parent partition, in Windows Server “8”, VFs are only used by virtual machines. A single PCI Express device can expose multiple PFs (such as a multi-port networking device), each (generally) independent, with their own set of VF resources. There are subtleties on multi-function devices such as ones which support, for example, iSCSI, Ethernet and FCoE, but this is beyond the depth of this series of posts and the approach differs between hardware vendors.

It is important to note that VFs are hardware resources. Because these are hardware resources, there are constraints on the number of VFs which are available on any particular device. The actual number will differ across vendors and devices. You can expect that as hardware moves forward through newer revisions, the trend will be to offer more VFs per PF. Typically we are seeing devices offering 16, 32 or 64 VFs per PF in 1st generation 10 GigE SR-IOV enabled networking hardware.

VFs alone aren’t sufficient to be able to securely allow a VM direct access to hardware. Traditional PCI Express devices generally “talk” in system physical address space (SPA) terms. As you may be aware, we don’t run guest operating systems in SPA, we run them in guest physical address space, or GPA. So there has to be something which translates (and ideally caches) addresses for DMA transfers. This is DMA remapping. In addition, for security reasons, we require hardware assisted interrupt remapping.  For those who are want to learn more about the hardware side of this, see this page about VT-d for Intel, or this page for AMD-V for AMD. There are plenty of specs to read for the inquisitive reader!
 
From this point on in this series, I’m going to generically use the term IOMMU to refer to hardware capabilities which provide interrupt and DMA remapping.

To be clear, as I haven’t said it explicitly otherwise, an SR-IOV device with a suitable driver can be used as a regular I/O device outside of virtualisation. It probably wouldn’t take advantage of the additional hardware capabilities without virtualisation present, but it still can be used as a regular I/O device. Further, the device does not require the presence of “IOMMU” hardware to be used in this manner.

Although I’ve referred to networking a few times so far, I’ve also said SR-IOV from a spec standpoint doesn’t mention anything about an I/O class. When we (the Hyper-V team) looked at where the biggest gains were in using SR-IOV, it was clear to us that the overhead of storage I/O was significantly less than that of networking I/O. Hence for Windows Server “8” we have exclusively worked on SR-IOV for networking as the only supported device class.

Although it may not be immediately obvious, for a NIC vendor to create an SR-IOV capable PCI Express device, it’s not sufficient to follow the SR-IOV specifications alone. One reason is that the NIC has to do networking to and from multiple sources (PF and VFs), as well as on the wire. To enable Ethernet frames to be routed between two VFs, for example, most parts of an Ethernet switch have to be embedded onto the physical NIC. None of this is present in the SR-IOV specifications.

So now we’ve covered SR-IOV from the “why” perspective in part 1, and the hardware perspective in this part. Part 3 will take a look at the software perspective of supporting SR-IOV in Windows Server “8”.

Cheers
John.