Well, it took eight hours of cobbling together instructions from a variety of guides on the Interwebs, but it’s done! I have added a GeForce GTX 960 to my server (running Fedora 22 Server) and I have given control of the GPU to a KVM/QEMU virtual machine. I can now play video games using the same hardware that simultaneously operates my FreeBSD 10.2 file server (another guest domain) and some other small utility servers (other guest domains).
So, in the interest of furthering and promoting the use of this great software, behold: the instructions!
I’m using the following hardware components:
- Motherboard: ASRock Z77 Extreme4 LGA1155
- Processor: Intel Core i5-3470
- RAM: 24 GB DDR3-1600
The Plan for a Gaming Guest Domain
The plan is to grant a particular guest domain access to the GPU (and therefore the monitor connected to it), the onboard sound (and therefore the speakers connected to it), a mouse, and a keyboard. This way, guest domain can be used in exactly the same manner that one would use any computer to which these human interface devices are connected. For the mouse and keyboard, we can make use of KVM/QEMU’s simple USB device passthrough. For the GPU and sound, we’ll need to make use of PCI device passthrough.
We’re going to:
- Ensure hardware support for IOMMU passthrough
- Conduct a survey of our PCI devices
- Isolate the PCI devices to be passed to our guest domain
- Configure and install the guest domain
Exciting! Let’s get to it.
Ensuring Hardware Support for IOMMU Passthrough
In order to make use of this functionality, one must ensure that one’s hardware supports VT-d (if Intel) or AMD-Vi (if AMD). These processor extensions are required for IOMMU passthrough. If they’re not available, you’re out of luck.
AMD processors must include AMD-Vi instructions (marked by the “svm” flag)
cat /proc/cpuinfo | grep svm #If no output is provided, the Linux kernel does not #believe your processor is supported
Intel processors provide the vmx flag, but this only indicates VT-x support, which is insufficient on its own (we need VT-d). The best way to determine support is checking the product site and confirming the processor supports VT-d.
Additionally, your motherboard must support VT-d or AMD-Vi, as appropriate. Again, check the vendor site or the documentation with the hardware).
Finally, ensure the option to enable IOMMU support is selected in the motherboard BIOS or EFI. Most of the time, this option is in the North Bridge configuration or thereabouts, but check the manual for your particular motherboard for the location of this option which should include the term “IOMMU”
Preparing the Linux Kernel for IOMMU Support
Now, you need to modify your GRUB configuration so that the Linux kernel boots with recognition and support for IOMMU. To do this, simply add the argument “intel_iommu=on” or “amd_iommu=on” to the kernel arguments in GRUB. On Fedora 22, this involves modifying the /etc/sysconfig/grub file and then running grub2-mkconfig. For my configuration, this means:
$ sudo vim /etc/sysconfig/grub GRUB_CMDLINE_LINUX="rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on" :wq $ sudo grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
Again, more details are available at this great blog post from Alex Williamson. Note that your grub2-mkconfig output path will be different if you’re using a BIOS and not UEFI.
At this point, you may reboot your system and confirm that IOMMU is recognized properly by consulting the kernel ring buffer through dmesg:
$ dmesg | grep -i IOMMU [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.2.6-201.fc22.x86_64 root=/dev/mapper/fedora--server-root ro rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.2.6-201.fc22.x86_64 root=/dev/mapper/fedora--server-root ro rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on [ 0.000000] DMAR: IOMMU enabled [ 0.102255] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1 [ 0.439091] iommu: Adding device 0000:00:00.0 to group 0 ... [ 0.439308] iommu: Adding device 0000:0a:00.0 to group 22
Conducting A Survey of PCI Devices and their IOMMU Groups
In order to grant to a particular guest domain direct control over a variety of PCI devices which would otherwise be controlled by the host OS, the first thing to do is to identify those devices and their IOMMU groups. Currently, without patching the Linux kernel manually (which I do not recommend), only entire IOMMU groups can be passed to guest domains. Sometimes IOMMU groups contain only individual PCI devices, and that makes passthrough simple and convenient. Other times, PCI devices are lumped together inextricably into a single IOMMU group, and that requires that the set of devices in their entirety be passed to a target guest domain. On my motherboard, this is manageable, and I don’t have any need to entertain the patching of the kernel to allow for separating devices from their IOMMU group peers.
The first thing to do is hit up good ol’ reliable lspci and check out the array of devices available on your system:
$ lspci 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09) 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09) ... 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) ... 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04) 00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04) 01:00.0 VGA compatible controller: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1) 01:00.1 Audio device: NVIDIA Corporation Device 0fba (rev a1) 02:00.0 PCI bridge: PLX Technology, Inc. Device 8603 (rev ab) ... 0a:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller
In that listing (truncated where so indicated by ellipses for convenience), we see on each line the device address followed by information about each device. I have emboldened the devices of primary interest for the guest domain which will be used to play games and perform as a general desktop:
- The GPU (01:00.0)
- The GPU’s integrated sound facility for HDMI support (01:00.1)
- The motherboard’s integrated sound apparatus (00:1b.0)
In order to identify the IOMMU group layout of the system, we can turn to the sys virtual file system:
$ find /sys/kernel/iommu_groups/ -type l /sys/kernel/iommu_groups/0/devices/0000:00:00.0 /sys/kernel/iommu_groups/1/devices/0000:00:01.0 /sys/kernel/iommu_groups/1/devices/0000:01:00.0 /sys/kernel/iommu_groups/1/devices/0000:01:00.1 /sys/kernel/iommu_groups/2/devices/0000:00:02.0 ... /sys/kernel/iommu_groups/6/devices/0000:00:1b.0 ... /sys/kernel/iommu_groups/13/devices/0000:00:1f.0 /sys/kernel/iommu_groups/13/devices/0000:00:1f.2 /sys/kernel/iommu_groups/13/devices/0000:00:1f.3 /sys/kernel/iommu_groups/14/devices/0000:02:00.0 ... /sys/kernel/iommu_groups/22/devices/0000:0a:00.0
Again, the output has been truncated for convenience, but you should still be able to observe correspondence between the output above and the previous lspci output.
This command finds all of the symbolic links in the /sys/kernel/iommu_groups directory and displays them. This is a convenient way to observe the IOMMU group constituents; each group is specified by the subdirectory immediately following the iommu_groups directory and its members are iterated plainly. For example, group 1 contains devices 000:00:01.0, 0000:01:00.0, and 0000:01:00.1, whereas group 2 contains only device 0000:00:02.0.
As you can see, groupings are convenient for our purposes here; both GPU PCI addresses are held by IOMMU group 1 (along with the PCI Express root port 0000:00:01.0) and the sound device stands alone in IOMMU group 6.
Another way to get some information about the arrangement of your PCI devices is to use virsh nodedev-list –tree. Though we don’t need any more information for our purposes at this point, I recommend giving it a shot – it’s very illustrative of the PCI device architecture.
Isolating the PCI Devices for Guest Domains
Currently, the host OS (Fedora 22 Server) controls these PCI devices. I can inspect any given device and identify the driver currently in use by the host OS Linux kernel to see exactly how the device is being managed:
$ lspci -vv -s 00:01b.0 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) Subsystem: ASRock Incorporation Z77 Extreme4 motherboard Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 3 Region 0: Memory at f7c10000 (64-bit, non-prefetchable) [disabled] [size=16K] Capabilities: <access denied> Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel
As you can see, the snd_hda_intel driver is being used to manage the audio device. If I were to query the GPU, I’d find that the nouveau driver is being used to manage it (as is the default when using Fedora products and NVidia GPUs).
In order to allow the guest domains to take control of the hardware, I need to prevent the host OS from taking control of that hardware. There are two methods by which to accomplish this sort of hardware isolation: pci-stub and vfio-pci. Both are discussed by this great blog series.
I recommend using pci-stub at this point because it is statically loaded into the Linux kernel by default with Fedora 22 Server. I attempted to follow the instructions on that site for preloading the vfio-pci drivers, but my machine would refuse to boot when I included the kernel argument necessary for the preloading. I could’ve generated my initramfs incorrectly or something, but basically, I don’t see any advantage to using the vfio-pci driver and the pci-stub driver works perfectly well.
So, to isolate the hardware components we need to use, we need to determine the PCI device IDs (which are separate from the device addresses listed in the first field of the lspci output above). To do this, we specify each device by address and ask for the ID:
$ lspci -n -s 00:01b.0 00:1b.0 0403: 8086:1e20 (rev 04)
The bold portion (the third space-delimited field in the output) is the device ID we’re after. We will identify these devices by their IDs to the pci-stub driver so that it claims the devices during the system’s boot so that the devices will be eligible for PCI passthrough to guest domains (which would not be the case if the host OS were controlling them through other drivers not intended to support this functionality).
So now, we need to modify our GRUB line again in /etc/sysconfig/grub. This time, we’re adding the pci-stub directive to claim these devices by their IDs:
$ sudo vim /etc/sysconfig/grub GRUB_CMDLINE_LINUX="rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on pci-stub.ids=8086:0152,10de:1401,8086:1e20" :wq $ sudo grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
As you can see, you simply enter the argument pci-stub.ids= and follow that with a comma-delimited series of device IDs. Remember to make sure you claim all devices which are members of the IOMMU groups you are passing through to your guest domain. If you attempt to pass through only part of an IOMMU group, your system will probably crash. Obviously, this implies that you should ensure you’re not passing through any devices necessary to the operation of the host, as well (e.g. don’t try to use pci-stub to claim the PCI controller connected to your boot disk, or your system will fail to boot!).
With this done, you may try rebooting your system and observing again in the kernel ring buffer the success of the pci-stub driver in claiming the hardware identified to it:
$ dmesg | grep pci-stub [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.2.6-201.fc22.x86_64 root=/dev/mapper/fedora--server-root ro rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on pci-stub.ids=8086:0152,10de:1401,8086:1e20,8086:0100 [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.2.6-201.fc22.x86_64 root=/dev/mapper/fedora--server-root ro rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on pci-stub.ids=8086:0152,10de:1401,8086:1e20,8086:0100 [ 0.447971] pci-stub: add 8086:0100 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000 [ 0.447973] pci-stub 0000:00:01.0: claimed by stub [ 0.447974] pci-stub: add 8086:0152 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000 [ 0.447979] pci-stub 0000:00:02.0: claimed by stub [ 0.447982] pci-stub: add 10DE:1401 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000 [ 0.447986] pci-stub 0000:01:00.0: claimed by stub [ 0.447988] pci-stub: add 8086:1E20 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000 [ 0.447991] pci-stub 0000:00:1b.0: claimed by stub
Looks good! All the devices we identified were successfully claimed. Now, if you check out the lspci output for one of the devices, you’ll see that the pci-stub driver is in use for the device:
$ lspci -vv -s 00:01b.0 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) Subsystem: ASRock Incorporation Z77 Extreme4 motherboard Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 3 Region 0: Memory at f7c10000 (64-bit, non-prefetchable) [disabled] [size=16K] Capabilities: <access denied> Kernel driver in use: pci-stub Kernel modules: snd_hda_intel
Excellent. As long as all of your devices check out, we’ve done the hard work. Now we have only to use virt-manager to build a guest domain, trim down the stock virtualized hardware, and then engage the PCI passthrough mechanism.
Configuring and Installing the Guest Domain
Again, I found Alex Williamson’s blog to be most helpful. You can follow his instructions more or less directly, but basically we’re going to:
- Install the UEFI firmware image to be used by the guest domain.
- This is essential to getting the guest domain to make use of the GPU, in my experience.
- Perform a relatively standard installation of a guest domain
- Trim the unnecessary hardware from the guest domain
- Add the PCI devices to the guest domain
- Reboot the guest domain and install the drivers appropriate to the graphics card
Installing the UEFI firmware image
So, in order for this to work, you need to emulate a UEFI environment for the guest domain rather than implement the stock BIOS emulation. To accomplish this, I make use of OVMF (Open Virtual Machine Firmware). These images can be obtained by following these simple instructions. You simply add Gerd Hoffmann’s repository and install the firmware through DNF. Thanks, Gerd! You are the man.
Installing the Guest Domain
Now that you’ve got the UEFI firmware on your system, you should be able to follow the instructions at Alex Williamson’s blog to create your VM. Go through the standard VM build in virt-manager and ensure that you select “Customize Configuration Before Install” at the end of the wizard. This allows you to select UEFI rather than BIOS. You can follow Alex’s remaining VM tuning advice if you like. I don’t personally bother with hugepages or CPU pinning (I’m letting my gaming rig have a share of all processor cores on the system), but they aren’t bad ideas.
Once you’ve performed this initial configuration, boot the guest domain and install the OS of your choice (I’m using Fedora 23 Workstation, but I may develop a Windows 10 guest domain as well). When you’re done, reboot the system, fully update it, and ensure that you give yourself a means by which to access it remotely if your graphics card endeavor fails on your first attempt (that means enabling sshd, at the very least, in Fedora 23 Workstation).
Once you’ve got the system installed and updated, shut down the guest domain and remove all extraneous hardware as Alex demonstrates. Add the USB keyboard and USB mouse connected to your host machine which you intend to use for the guest domain using the simple USB passthroug option. Then, add in the PCI devices representing your GPU (and sound, if you’re doing that as well), ensure your monitor is connected to your GPU as you would normally connect it for any other desktop.
Now, if you’re using an NVidia GPU, it is imperative that you also modify the guest domain’s XML configuration file (sudo virsh edit guestname) and follow Alex’s instructions in adding the following tags within the <features> section:
If you don’t do that, your guest domain will not boot and you will be most depressed.
Once all that’s done, power the system on! As you anxiously observe the monitor for signs of life, you should observe the UEFI Tianocore splash screen, followed by GRUB. The system should come up as any other system would.
If your entire KVM/QEMU platform crashes when you shut down your guest domain, throwing stack trace output with
Code: Bad RIP value
like this, you might double-check your pci-stub device assignments. In my case, the problem was that I had not attached the pci-stub driver to the audio facility on the GPU. In fact, if you were paying close attention to my output above, you probably noticed that I accidentally pci-stubbed device 00:02.0 (and that’s my Intel processor’s integrated Graphics Controller) rather than 01:00.1 (my NVidia card’s audio facility).
Once I pci-stubbed the proper device and rebooted, I could reboot my guest domain at whim without worry of a crash.