VFIO Device Assignment Quirks

How to Use Them and How to Avoid Them

 

Alex Williamson / alex.williamson@redhat.com

A quick VFIO refresher

VFIO: A userspace driver interface

Devices are decomposed into a userspace API

QEMU consumes the VFIO API

Recomposing the physical device to a virtual device

For further details:

KVM Forum 2016: "An Introduction to PCI Device Assignment with VFIO"

Quirks

What are Quirks


Quirks are software bandages to account for missing or broken device or topology features, or implement additional device virtualization

Where can we use Quirks?


Anywhere, but our goal is to assign the device and let it run without further fast path interaction

Where do we use Quirks?

Examples

Disclaimer

Hardware mistakes and oversights happen. It's fun to pick at vendors, but in many cases quirks represent cases where hardware vendors have worked with us to “correct” hardware behavior in software. In some cases the corrections are evident in later generations of hardware. This should be encouraged.

Isolation

  • Can untranslated DMA reach other devices?
  • Access Control Services (ACS) enforces packet routing within PCIe topology
  • Without ACS, redirection is assumed
    • IOMMU group size increases
    • Assignment granularity decreases
  • Vendors can specify ACS equivalent routing
    • Bandage the hardware in software

Isolation: Examples

  • Intel PCH PCIe root ports
    • 5- through 9-series chipsets: No ACS
      • Chipset specific equivalent features
    • 100- & 200-series: Broken ACS
    • Fixed in 300-series chipsets!
  • Intel “client” processor root ports
    • No ACS, no isolation guarantees, no quirks
    • Use server processors   :-\

Isolation: More Examples

  • AMD Ryzen root ports
    • Fixed in firmware update
  • Numerous ACS endpoint quirks for many vendors
    • AMD, Ampere, Cavium, Emulex, Intel NICs, Qualcom, Solarflare

Isolation: We're getting better

  • ACS is increasingly common on endpoint and interconnect components
  • Reminder: Implement ACS on all downstream ports and multifunction endpoints

DMA Aliases

  • Does the DMA requester match the device address?
  • IOMMU cannot function for bare metal or device assignment if inconsistent

DMA Aliases: Examples

  • PCIe function 0 is the requester: Ricoh
  • PCIe function 1 is the requester: Marvell
  • A different PCI slot is the requester: Adaptec
  • Hidden requesters behind non-transparent bridge


Hardware designs must consider an IOMMU and use predictable, discoverable requester IDs

Resets

  • Return device to a known state
  • Wipe on-device data between uses
  • Multiple mechanisms through PCI:
    PCIe FLR, AF FLR, PM, bus reset, hot-plug slot


Apparently still a difficult hardware feature

Reset: Examples

  • Intel NMVe: extra post-FLR delay
  • Samsung NVMe: controller disable before FLR
  • AMD Radeon: bus reset bugs
  • Atheros: devices disappear on bus reset
  • Threadripper root ports fail after bus reset
    • Fixed in firmware!

Virtualization

  • The “asterisk” in the new QEMU device
  • All devices require some degree of virtualization
    • Address space
    • Topology

Some devices require a little extra virtualization…

Virtualization: Examples

  • VGA bootstrapping
    • See 2013 & 2014 KVM Forum talks
    • Mirrors and windows to PCI config space
  • Intel IGD GTT programming
  • Intel i40e INTx status register
  • Intel SR-IOV VF INTx pin masking
  • Realtek RTL8168 MSI-X programming
  • Chelsio T5 bogus MSI-X PBA fixup
  • General MSI-X relocation

Summary: Quirks

  • Know your quirks
  • Available throughout the device assignment stack
  • Supplement missing and broken features
  • Mask device and address space issues
  • Not needed by well behaved devices

For Hardware Designers

  • Implement PCIe ACS per specification
    • Downstream ports & multifunction endpoints
  • Provide working function level reset
  • Avoid leaking device physical addresses
  • Avoid page size issues with separate MSI-X BAR

Questions?

Alex Williamson / alex.williamson@redhat.com