Implementing PCI Device Passthrough (IOMMU) with Intel VT-d, KVM, QEMU, and libvirtd on Fedora 21

Update

See the next article addressing the topic of IOMMU-based PCI passthrough with KVM/QEMU in Fedora 22 (using an NVidia GPU, even!) for another walkthrough.

Procedure

Hardware inventory

1)  Ensure your processor supports IOMMU:

  • AMD processors must include AMD-Vi instructions (marked by the “svm” flag)
$ cat /proc/cpuingo | grep svm
#If no output is provided, the Linux kernel does not 
#believe your processor is supported
  • Intel processors provide the vmx flag, but this only indicates VT-x support, which is insufficient on its own (we need VT-d).  The best way to determine support is checking the product site, confirming the processor supports VT-d, and moving to the next step.

2)  Ensure your motherboard’s firmware supports AMD-Vi or VT-d, as appropriate (just check the vendor site or the documentation with the hardware).

3)  Ensure the option to enable IOMMU support is selected in the motherboard BIOS or EFI

  • Most of the time, this option is in the North Bridge configuration or thereabouts, but check the manual for your particular motherboard for the location of this option which should include the term “IOMMU”

4)  Boot into Fedora 21 and execute the following command to determine if the system recognizes this capability:

#Good output looks like:

$ dmesg | grep IOMMU
[    0.100209] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020e60262 ecap f0101a
[    0.100214] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c9008020660262 ecap f0105a
[    0.100285] IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1

#or

$ dmesg | grep IOMMU
[    0.100209] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 
[    0.100214] AMD-Vi: Lazy IO/TLB flushing enabled 
[    0.100285] AMD-Vi: Initialized for Passthrough Mode

That’s it; if you’ve successfully completed the steps above, you have met the hardware prerequisites for this operation.  The hardware I’m using for this purpose (and whose functionality I can verify) is as follows:

  • Motherboard:  ASRock Z77 Extreme4
  • Processor:  Intel Core i5-3470 LGA-1155

The motherboard is well-equipped with USB 3.0 and 2.0 jacks, SATA II and SATA III ports, and PCI expansion card slots providing a solid, versatile platform for virtualization.  With the PCI expansion card space, one can add NICs and fast IO devices, or even GPUs, to provide directly to guest domains (virtual machines).  As you will see below, choosing the correct locations for these devices requires a survey of the motherboard’s PCI architecture.

Preparing the Fedora 21 Operating System as a Virtualization Platform

1)  Enable IOMMU support in the Linux Kernel

  • To test everything out without permanently modifying GRUB to boot with this support included, simply boot your machine to the GRUB menu and append “intel_iommu=on” to the end of the kernel arguments and boot.  If the system comes up normally, you should be in the clear.
    • If your machine does come up normally, you can execute `cat /proc/cmdline` to verify that the argument you provided to the kernel (intel_iommu=on) was recognized.

When you’ve tested your system and all is well, append the intel_iommu=on argument to the end of the GRUB_CMDLINE_LINUX string as done in the following example:

 # vim /etc/default/grub
    GRUB_CMDLINE_LINUX="rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on"
    :wq
 # grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

2)  Reboot.

3) Install the Virtualization Platform group:

yum groupinstall @virtualization

4)  Reboot (may not be necessary; I’d do it anyway).

Surveying Your Motherboard’s PCI Device Architecture

So, the motherboard I’m using for this demonstration is listed above.  Your results will vary depending on the way in which your motherboard is designed, but the principles laid out here should give you the understanding and tools you need to figure out the capabilities of your hardware and design on that basis.

VT-d and KVM support IOMMU remapping based on IOMMU groups, which are collections of PCI devices to which control may be passed.  You can’t always simply pass control to a particular hard disk; it is often the case that the hard disk will connect to a port that is part of a single controller providing a number of the other ports physically near by.  This entire controller will need to be passed to the guest, so this can require some planning to ensure you don’t assign, say, a four-port SATA controller to a guest which needs only two hard disks if it can be avoided.

The first thing to do is check out the layout of the system in virsh:

$ sudo virsh
virsh # nodedev-list --tree
computer
  |
  +- net_lo_00_00_00_00_00_00
  +- pci_0000_00_00_0
  +- pci_0000_00_02_0
  +- pci_0000_00_14_0
  |   |
  |   +- usb_usb1
  |   |   |
  |   |   +- usb_1_0_1_0
  |   |   +- usb_1_1
  |   |   |   |
  |   |   |   +- usb_1_1_1_0
  |   |   |       |
  |   |   |       +- scsi_host8
  |   |   |           |
  |   |   |           +- scsi_target8_0_0
  |   |   |               |
  |   |   |               +- scsi_8_0_0_0
  |   |   |                   |
  |   |   |                   +- block_sdf_SanDisk_Cruzer_Fit_4C530009730716110301_0_0
  |   |   |                   +- scsi_generic_sg5
  |   |   |                     
  |   |   +- usb_1_2
  |   |       |
  |   |       +- usb_1_2_1_0
  |   |         
  |   +- usb_usb2
  |       |
  |       +- usb_2_0_1_0
  |         
  +- pci_0000_00_16_0
...

The capture above is only a small fraction of the entire device map of my motherboard.  In it, we can see a general form which will be repeated and modified throughout the tree.  The device nodes whose labels begin with “pci_” are going to be the devices we can consider passing through to our guests, though even the top-level pci_ devices may be in the same IOMMU group (and therefore require that they be jointly added to any given guest domain).

The most expedient way to go about this is to identify the PCI devices hosting the hardware you are interested in passing directly to guest domains and check on their groups by referencing their corresponding virtual file system directories in /sys/bus/pci/devices/.  For example, if I were interested in the device labeled “pci_0000_00_1f_0”, I would query as follows:

$ ll /sys/bus/pci/devices/0000\:00\:1f.0/iommu_group/devices/
     lrwxrwxrwx. 1 root root 0 May  3 10:29 0000:00:1f.0 -> ../../../../devices/pci0000:00/0000:00:1f.0
     lrwxrwxrwx. 1 root root 0 May  3 10:29 0000:00:1f.2 -> ../../../../devices/pci0000:00/0000:00:1f.2
     lrwxrwxrwx. 1 root root 0 May  3 10:29 0000:00:1f.3 -> ../../../../devices/pci0000:00/0000:00:1f.3

This provides the other PCI devices sharing an IOMMU Group with the queried device.  So, what we see above is that pci_0000_00_1f.0 must be passed to a guest domain along with devices pci_0000_00_1f.2 and pci_0000_00_1f.3.

If we look at the portion of my device tree which represents these objects, we find that all three devices appear as top-level nodes, so there doesn’t appear to be any way to discern IOMMU Group membership from the virsh nodedev-list output.  Unfortunately, the best I know to do is investigate each PCI device of interest individually to gain an understanding of the group topology and plan accordingly.

Detaching PCI Devices in Preparation for Guest Domain Control Transfer

Assuming we had identified the IOMMU Group containing these three PCI devices (above) as the group whose control we would like to transfer to the guest domain, we must now detach these devices from the virtualization platform operating system’s kernel (which is acting as the hypervisor for the guest domains), leaving them available to be controlled by the target guest domain.

#If you're already in virsh, omit the first step:
$ sudo virsh 
virsh # nodedev-dettach pci_0000_00_1f_0
Device pci_0000_00_1f_0 detached
virsh # nodedev-dettach pci_0000_00_1f_2
Device pci_0000_00_1f_2 detached
virsh # nodedev-dettach pci_0000_00_1f_3
Device pci_0000_00_1f_3 detached

Using virsh and the Virtual Machine Manager to Grant Device Control to Guest Domains

Brief Apology:  I like a command-line-centric approach to problems as much as anyone, but the Virtual Machine Manager is an excellent piece of software, and its GUI interface for the next steps is just a lot easier than the virsh path.  I may write instructions for the latter at a later date, but if you’re managing a hypervisor arrangement of this complexity, I imagine you have VMM installed on a remote workstation anyway, or you don’t need me to tell you how to do this in virsh.

Once the device groups are identified and the guest domains are ready to have the devices added (all other hardware choices have been made), open the Virtual Machine Manager and:

  1. Select the relevant guest domain
  2. Select Open and then Show virtual hardware details in the upper right.
  3. Choose Add Hardware in the bottom left.
  4. Select PCI Host Device from the list of options on the left.
  5. Locate the PCI devices in the IOMMU Group whose control you wish to transfer to the guest domain which have the values recorded from the nodedev-list output above, converting underscores to colons as demonstrated in the sys virtual file system path above, and add them individually to the guest domain until all have been added.

So, continuing the example using the PCI device IDs provided above, I would search the list for three PCI devices whose addresses read as 0000:00:1f:0, 0000:00:1f:2, and 0000:00:1f:3.

Start the Guest Domain

And enjoy!  If all has gone well, you should have no problems.  If something has gone wrong, your guest domain will likely suffer a kernel panic, so the problem will be apparent immediately.

Welcome to the cutting edge of modern system engineering!  It is flat-out amazing what we can do with commodity hardware.

Advertisements
This entry was posted in Information Technology and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s