Working around the AMD GPU Reset bug on Proxmox using vendor-reset

Most modern AMD GPUs suffer from the AMD Reset Bug: The card cannot be reset properly, so it can only be used once per host power-on. The second time the card is tried to be used Linux will attempt to reset it and fail, causing the VM launch to fail, or the guest, host or both to hang.

gnif’s new vendor-reset project is an attempt to work around this AMD reset issue by replacing AMD’s missing FLR support with vendor-specific reset quirks.

The current lineup of supported GPUs includes various Polaris, Vega and Navi models, including GPUs in the same series as the RX 480, RX 540, RX 580, Vega 56/64, Radeon VII, 5500XT, 5700XT, Pro 5600M (see the repo for the full list of supported chipsets).

Installing vendor-reset on Proxmox

First, update the kernel to the latest version and reboot (i.e. apt update && apt dist-upgrade). Otherwise the kernel headers fetched by pve-headers won’t match the currently-running kernel, and dkms will fail to build the package.

Now you can install vendor-reset like so:

# Get latest Proxmox kernel headers:
apt install pve-headers

# Did that fail? If so make sure you have Proxmox repository set up properly! https://pve.proxmox.com/wiki/Package_Repositories

# Get required build tools:
apt install git dkms build-essential

# Perform the build:
git clone https://github.com/gnif/vendor-reset.git
cd vendor-reset
dkms install .

# Enable vendor-reset to be loaded automatically on startup:
echo "vendor-reset" >> /etc/modules
update-initramfs -u

# Reboot to load the module:
shutdown -r now

Now when you start a VM that uses an AMD GPU, you’ll see messages like this appear in your dmesg output, showing that the new reset procedure is being used:

vfio-pci 0000:03:00.0: AMD_POLARIS10: version 1.0
vfio-pci 0000:03:00.0: AMD_POLARIS10: performing pre-reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: performing reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: GPU pci config reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: performing post-reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: reset result = 0

Unfortunately, with my RX 580, this module didn’t solve the reset issue for me, at least with macOS guests. However on a bunch of newer AMD GPUs, vendor-reset is the answer to your prayers!

(Before adding vendor-reset, I got errors like this reported from the PCIe root port the second time the card was initted by a macOS guest:)

pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:02.0
pcieport 0000:00:02.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:00:02.0: AER: device [8086:3c04] error status/mask=00004000/00000000
pcieport 0000:00:02.0: AER: [14] CmpltTO (First)
pcieport 0000:00:02.0: AER: Device recovery successful

After that the host’s kernel threads started reporting soft lockups until the whole host was brought down.

Now I no longer see those AER messages, but I still get “DMAR: DRHD: handling fault status reg 40”, followed by soft lockups that kill the host.

One thought on “Working around the AMD GPU Reset bug on Proxmox using vendor-reset”

  1. Saw this on Level1techs forum, and I had a feeling you would post about it too. I am using a WX5100 in my server, and that has been quite patchy as the reset bug sometimes affected it, sometimes it didn’t. Since I have installed this module though, it has been working properly every time the guest OS resets or shuts down.
    Very happy with the current behaviour.

    Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.