Most modern AMD GPUs suffer from the AMD Reset Bug: The card cannot be reset properly, so it can only be used once per host power-on. The second time the card is tried to be used Linux will attempt to reset it and fail, causing the VM launch to fail, or the guest, host or both to hang.
gnif’s new vendor-reset project is an attempt to work around this AMD reset issue by replacing AMD’s missing FLR support with vendor-specific reset quirks.
The current lineup of supported GPUs includes various Polaris, Vega and Navi models, including GPUs in the same series as the RX 480, RX 540, RX 580, Vega 56/64, Radeon VII, 5500XT, 5700XT, Pro 5600M (see the repo for the full list of supported chipsets).
Installing vendor-reset on Proxmox
First, update the kernel to the latest version and reboot (i.e.
apt update && apt dist-upgrade). Otherwise the kernel headers fetched by pve-headers won’t match the currently-running kernel, and dkms will fail to build the package.
Now you can install vendor-reset like so:
# Get latest Proxmox kernel headers: apt install pve-headers # Did that fail? If so make sure you have Proxmox repository set up properly! https://pve.proxmox.com/wiki/Package_Repositories # Get required build tools: apt install git dkms build-essential # Perform the build: git clone https://github.com/gnif/vendor-reset.git cd vendor-reset dkms install . # Enable vendor-reset to be loaded automatically on startup: echo "vendor-reset" >> /etc/modules update-initramfs -u # Reboot to load the module: shutdown -r now
Now when you start a VM that uses an AMD GPU, you’ll see messages like this appear in your
dmesg output, showing that the new reset procedure is being used:
vfio-pci 0000:03:00.0: AMD_POLARIS10: version 1.0 vfio-pci 0000:03:00.0: AMD_POLARIS10: performing pre-reset vfio-pci 0000:03:00.0: AMD_POLARIS10: performing reset vfio-pci 0000:03:00.0: AMD_POLARIS10: GPU pci config reset vfio-pci 0000:03:00.0: AMD_POLARIS10: performing post-reset vfio-pci 0000:03:00.0: AMD_POLARIS10: reset result = 0
Unfortunately, with my RX 580, this module didn’t solve the reset issue for me, at least with macOS guests. However on a bunch of newer AMD GPUs, vendor-reset is the answer to your prayers!
(Before adding vendor-reset, I got errors like this reported from the PCIe root port the second time the card was initted by a macOS guest:)
pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:02.0
pcieport 0000:00:02.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:00:02.0: AER: device [8086:3c04] error status/mask=00004000/00000000
pcieport 0000:00:02.0: AER:  CmpltTO (First)
pcieport 0000:00:02.0: AER: Device recovery successful
After that the host’s kernel threads started reporting soft lockups until the whole host was brought down.
Now I no longer see those AER messages, but I still get “DMAR: DRHD: handling fault status reg 40”, followed by soft lockups that kill the host.