Working around the AMD GPU Reset bug on Proxmox using vendor-reset

Most modern AMD GPUs suffer from the AMD Reset Bug: The card cannot be reset properly, so it can only be used once per host power-on. The second time the card is tried to be used Linux will attempt to reset it and fail, causing the VM launch to fail, or the guest, host or both to hang.

This is especially a problem if you only have one GPU in your system, because it will be your primary GPU and so be initialised by the host UEFI during boot, rendering it unusable for passthrough even a single time.

gnif’s new vendor-reset project is an attempt to work around this AMD reset issue by replacing AMD’s missing FLR support with vendor-specific reset quirks.

The current lineup of supported GPUs includes various Polaris, Vega and Navi models, including GPUs in the same series as the RX 480, RX 540, RX 580, Vega 56/64, Radeon VII, 5500XT, 5700XT, Pro 5600M (see the repo for the full list of supported chipsets).

Installing vendor-reset on Proxmox

First, update the kernel to the latest version and reboot (i.e. apt update && apt dist-upgrade). Otherwise the kernel headers fetched by pve-headers won’t match the currently-running kernel, and dkms will fail to build the package.

Now you can install vendor-reset like so:

# Get latest Proxmox kernel headers:
apt install pve-headers

# Did that fail? If so make sure you have Proxmox repository set up properly! https://pve.proxmox.com/wiki/Package_Repositories

# Get required build tools:
apt install git dkms build-essential

# Perform the build:
git clone https://github.com/gnif/vendor-reset.git
cd vendor-reset
dkms install .

# Enable vendor-reset to be loaded automatically on startup:
echo "vendor-reset" >> /etc/modules
update-initramfs -u

# Reboot to load the module:
shutdown -r now

Now when you start a VM that uses an AMD GPU, you’ll see messages like this appear in your dmesg output, showing that the new reset procedure is being used:

vfio-pci 0000:03:00.0: AMD_POLARIS10: version 1.0
vfio-pci 0000:03:00.0: AMD_POLARIS10: performing pre-reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: performing reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: GPU pci config reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: performing post-reset
vfio-pci 0000:03:00.0: AMD_POLARIS10: reset result = 0

Unfortunately, with my RX 580, this module didn’t solve the reset issue for me, at least with macOS guests. However on a bunch of newer AMD GPUs, vendor-reset is the answer to your prayers!

(Before adding vendor-reset, I got errors like this reported from the PCIe root port the second time the card was initted by a macOS guest:)

pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:02.0
pcieport 0000:00:02.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:00:02.0: AER: device [8086:3c04] error status/mask=00004000/00000000
pcieport 0000:00:02.0: AER: [14] CmpltTO (First)
pcieport 0000:00:02.0: AER: Device recovery successful

After that the host’s kernel threads started reporting soft lockups until the whole host was brought down.

Now I no longer see those AER messages, but I still get “DMAR: DRHD: handling fault status reg 40”, followed by soft lockups that kill the host.

91 thoughts on “Working around the AMD GPU Reset bug on Proxmox using vendor-reset”

  1. Saw this on Level1techs forum, and I had a feeling you would post about it too. I am using a WX5100 in my server, and that has been quite patchy as the reset bug sometimes affected it, sometimes it didn’t. Since I have installed this module though, it has been working properly every time the guest OS resets or shuts down.
    Very happy with the current behaviour.

    Cheers!

  2. Having a problem at this step:

    $ dkms install .

    Creating symlink /var/lib/dkms/vendor-reset/0.0.18/source ->
    /usr/src/vendor-reset-0.0.18

    DKMS: add completed.
    Error! Your kernel headers for kernel 5.4.34-1-pve cannot be found.
    Please install the linux-headers-5.4.34-1-pve package,
    or use the –kernelsourcedir option to tell DKMS where it’s located

    All steps prior to that were done. Well, I’m not sure about this one since you’re not explicit about which repos are needed.

    # Get latest Proxmox kernel headers:
    apt install pve-headers

    # Did that fail? If so make sure you have Proxmox repository set up properly! https://pve.proxmox.com/wiki/Package_Repositories

    Yes, it failed and I added the following to /etc/apt/sources.list:
    # PVE pve-no-subscription repository provided by proxmox.com,
    # NOT recommended for production use
    deb http://download.proxmox.com/debian/pve buster pve-no-subscription

    Any ideas?

    I have to say, your guides are amazingly useful. You have no idea how grateful I am to have someone to learn from.

    1. After adding that repo you’ll need to run apt update to fetch it, then you’ll want to apt dist-upgrade to get all the Proxmox updates you’ve been missing out on. Then you can apt install pve-headers, before you can run dkms.

      If during your upgrades you end up installing a new kernel, the headers on disk won’t match the kernel that is currently running until you reboot Proxmox, so give Proxmox a reboot for good luck.

    2. I ran

      dkms install .

      and had the wrong headers.

      Had to run

      apt install pve-headers-$(uname -r)

      and then

      dkms install vendor-reset/0.0.18

      in the vendor-reset directory

      1. Yep that’s why I said to upgrade the kernel and restart first. Make sure you do also have “pve-headers” installed, because otherwise the next time the kernel is updated dkms will fail to rebuild the package for you, since the new headers won’t be downloaded for you automatically. (pve-headers is a metapackage that points to the newest headers)

      2. Thanks for you comment,
        Additionally I had to clean up old installation first, here are helpful commands;
        dpkg status
        dpkg –remove –force vendor-reset/<>

  3. First off… Thank You Nick for this post. I was able to apply the vendor-reset to my proxmox server and now I can power on and off my VM with my AMD RX570 4G passed through at will. I use this for hardware encoding of video streams from my Emby (like Plex) Media Server VM . I also have 2 other VM’s with the NVIDIA GTX 1050ti and Intel QuickSync (both passed through) also for Emby Media Server.

    I would also like to give my eternal gratitude to gnif for his work on this as well!

  4. Like you I still have issues getting the RX 580 to work consistently during passthrough, even with vendor-reset installed. I run into issues when sleeping on Windows VMs (which I try to avoid with any KVM) and switching between macOS and Windows. Any experiences there?

    Should probably wait for an RDNA 2 based lower end GPU or install a GTX 970, but that means not being able to use anything higher than High Sierra.

    1. I haven’t tried sleeping on any of my VMs. But yeah I have the same issues as you switching between mac and Windows. Big Navi is sounding interesting for a future purchase, maybe in a couple of generations once they’re cheap, lol

  5. Nick, thanks for all the posts. Have you tried passing in your GPU VBIOS via libvirt/qemu? This helped for resetting my 5700 XT via vendor-reset (even though there was nothing wrong with the identical VBIOS as already stored on the card). Good luck with your rig…it is so nice finally to be able to reset reasonable well!

    1. No, I haven’t tried that one, it may be worth a go…

      Did you download a BIOS from the web or just dump your current one?

      1. Either way will work. You can see your vBIOS firmware version in GPU-z under windows. You can dump it there, or with the AMD linux utility that is floating around. Or you could just grab the version number and download it from the techpowerup website.

        FWIW, I do not need to do this hack anymore. It was fixed for me by some combination of updating my motherboard firmware and/or updating to the latest vendor-reset build.

        Thanks again for all the write-ups!

    2. I assume Nick would have reported back if he made this work, but I decided to give it a go anyway and leave a comment here for anyone else looking through this thread. I have vendor-reset installed under Proxmox 6.3-6 but still have problems with my 8GB Sapphire RX 580 (SKU 11265-05-20G)

      I dumped the vBIOS with GPU-Z in a Windows 10 vm with the GPU passed through, placed it in /usr/share/kvm, and appended “,romfile=rx580_vbios.rom” to my “hostpci0:” line in /etc/pve/qemu-server/[vmid].conf

      Unfortunately that didn’t result in any change of behavior:
      * I am able to reboot, shutdown and power on windows 10 vm with the GPU passed through without any problems.
      * I am able to boot macOS BigSur once, but any attempt to reboot or power on after shutdown results in the dreaded “DMAR: DRHD: handling fault status reg 40” followed by lock ups that finally leaves the host unresponsive.

      Giving up for now.

  6. Hi,

    first of all thanks for your great work!

    After following all the steps in this tutorial:
    # Enable vendor-reset to be loaded automatically on startup:
    echo “vendor-reset” >> /etc/modules
    update-initramfs -u

    I get this error in the terminal:
    Running hook script ‘zz-pve-efiboot’..
    Re-executing ‘/etc/kernel/postinst.d/zz-pve-efiboot’ in new private mount namespace..
    No /etc/kernel/pve-efiboot-uuids found, skipping ESP sync.

    I don’t know what this means and how to fix it. After rebooting I’m unable to boot into Mac OS. I can see the Proxmox bootlogo on screen, and I can choos the disk I won’t to boot. But after selecting there is no Apple Logo, it is just a black screen without a error in Proxmox itself.

    Would be glad if anyone could help me with this.

    1. That one’s not an error, you can ignore that message. Which GPU model are you using, and did you make sure to set vga to none to disable the emulated video?

  7. Thanks for the fast reply!

    Ok, didn’t know that!

    I’m using AMD Radeon VII (Vega 20). I set the display to none.

    I get an similar error within Windows 10 where I can’t install any AMD drivers.

    I set up the GPU forwarding like in one of your tutorials.

    1. BTW. in Proxmox I get this error:
      TASK ERROR: start failed: command ‘/usr/bin/kvm -id 200 -name Catalina… +invtsc” failed: got timeout

      1. You can get timeouts like this due to a delay in allocating ram, or it could be a failure to reset the card. Check dmesg for card reset errors.

        If it’s just caused by ram allocation delays, start the VM like this to bypass the timeout “qm showcmd 200 | bash”

    2. I would work on solving the Windows VM problem first since it’s the easiest platform.

      Can you show the hostpci lines from your VM config file?

      1. This is the config file from in 10 VM:

        agent: 1
        bios: ovmf
        boot: order=scsi0;net0
        cores: 8
        cpu: kvm64,flags=+pdpe1gb
        efidisk0: local-lvm:vm-101-disk-1,size=4M
        hostpci0: 03:00,pcie=1,x-vga=1
        localtime: 1
        machine: q35
        memory: 16384
        name: win10pro
        net0: e1000=FA:9A:B7:FA:57:65,bridge=vmbr0,firewall=1
        numa: 1
        ostype: win10
        scsi0: local-lvm:vm-101-disk-0,cache=writeback,size=60G
        scsi1: /dev/disk/by-id/ata-ST2000DX002-2DV164_Z4ZCSG09,size=1953514584K
        scsihw: virtio-scsi-pci
        smbios1: uuid=dcafd7eb-840b-446c-a220-6a4bef0ad214
        sockets: 1
        usb0: host=1-7.1.3.2,usb3=1
        usb1: host=1-7.1.1,usb3=1
        vga: none
        vmgenid: 371ef0b1-f007-453a-b18d-7d40bea27ccc

        Do you need a list of all PCI devices?

        1. That is the GPU in the PCI list:

          01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a0 (rev c1)
          02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a1
          03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon VII] (rev c1)
          03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 HDMI Audio [Radeon VII]

          1. Not in Proxmox. in Windows this:
            After every restart my AMD GPU driver installation is completely gone. I can install AMD Adrenalin but at the end of the installation it says installation is complete and if I won’t to launch the application or restart the system. Launching the application gives me the error that not GPU is recognized and rebooting brings me to the error mentioned before.

            1. Can you show the dmesg output that shows the card being reset successfully then? vendor-reset prints a bunch of messages during reset.

              1. Sorry for that, that seems to be a lot of code:

                [ 38.048119] i915 0000:00:02.0: enabling device (0000 -> 0003)
                [ 38.048711] i915 0000:00:02.0: VT-d active for gfx access
                [ 38.048712] checking generic (6030000000 1300000) vs hw (4000000000 10000000)
                [ 38.049705] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
                [ 38.049706] [drm] Driver supports precise vblank timestamp query.
                [ 38.051171] iwlwifi 0000:00:14.3: Detected Intel(R) Dual Band Wireless AC 9560, REV=0x318
                [ 38.051516] snd_hda_intel 0000:03:00.1: Handle vga_switcheroo audio client
                [ 38.051765] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
                [ 38.059260] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
                [ 38.059554] iwlwifi 0000:00:14.3: Allocated 0x00400000 bytes for firmware monitor.
                [ 38.064592] [drm] amdgpu kernel modesetting enabled.
                [ 38.064761] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x6030000000 -> 0x603fffffff
                [ 38.064763] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x6040000000 -> 0x60401fffff
                [ 38.064764] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x56100000 -> 0x5617ffff
                [ 38.064765] checking generic (6030000000 1300000) vs hw (6030000000 10000000)
                [ 38.064766] fb0: switching to amdgpudrmfb from EFI VGA
                [ 38.077766] Console: switching to colour dummy device 80x25
                [ 38.077789] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
                [ 38.077827] amdgpu 0000:03:00.0: enabling device (0006 -> 0007)
                [ 38.077988] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66AF 0x1002:0x081E 0xC1).
                [ 38.077999] [drm] register mmio base: 0x56100000
                [ 38.078000] [drm] register mmio size: 524288
                [ 38.078009] [drm] add ip block number 0
                [ 38.078010] [drm] add ip block number 1
                [ 38.078010] [drm] add ip block number 2
                [ 38.078011] [drm] add ip block number 3
                [ 38.078011] [drm] add ip block number 4
                [ 38.078012] [drm] add ip block number 5
                [ 38.078012] [drm] add ip block number 6
                [ 38.078013] [drm] add ip block number 7
                [ 38.078013] [drm] add ip block number 8
                [ 38.078014] [drm] add ip block number 9
                [ 38.078031] ATOM BIOS: 113-D3600200-106
                [ 38.078812] [drm] UVD(0) is enabled in VM mode
                [ 38.078813] [drm] UVD(1) is enabled in VM mode
                [ 38.078813] [drm] UVD(0) ENC is enabled in VM mode
                [ 38.078813] [drm] UVD(1) ENC is enabled in VM mode
                [ 38.078814] [drm] VCE enabled in VM mode
                [ 38.078844] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
                [ 38.078853] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x6040000000-0x60401fffff 64bit pref]
                [ 38.078855] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x6030000000-0x603fffffff 64bit pref]
                [ 38.078868] pcieport 0000:02:00.0: BAR 15: releasing [mem 0x6030000000-0x60401fffff 64bit pref]
                [ 38.078869] pcieport 0000:01:00.0: BAR 15: releasing [mem 0x6030000000-0x60401fffff 64bit pref]
                [ 38.078871] pcieport 0000:00:01.0: BAR 15: releasing [mem 0x6030000000-0x60401fffff 64bit pref]
                [ 38.078879] pcieport 0000:00:01.0: BAR 15: assigned [mem 0x4200000000-0x47ffffffff 64bit pref]
                [ 38.078880] pcieport 0000:01:00.0: BAR 15: assigned [mem 0x4200000000-0x47ffffffff 64bit pref]
                [ 38.078882] pcieport 0000:02:00.0: BAR 15: assigned [mem 0x4200000000-0x47ffffffff 64bit pref]
                [ 38.078884] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x4400000000-0x47ffffffff 64bit pref]
                [ 38.078890] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x4200000000-0x42001fffff 64bit pref]
                [ 38.078897] pcieport 0000:00:01.0: PCI bridge to [bus 01-03]
                [ 38.078898] pcieport 0000:00:01.0: bridge window [io 0x5000-0x5fff]
                [ 38.078900] pcieport 0000:00:01.0: bridge window [mem 0x56100000-0x562fffff]
                [ 38.078902] pcieport 0000:00:01.0: bridge window [mem 0x4200000000-0x47ffffffff 64bit pref]
                [ 38.078905] pcieport 0000:01:00.0: PCI bridge to [bus 02-03]
                [ 38.078906] pcieport 0000:01:00.0: bridge window [io 0x5000-0x5fff]
                [ 38.078909] pcieport 0000:01:00.0: bridge window [mem 0x56100000-0x561fffff]
                [ 38.078912] pcieport 0000:01:00.0: bridge window [mem 0x4200000000-0x47ffffffff 64bit pref]
                [ 38.078915] pcieport 0000:02:00.0: PCI bridge to [bus 03]
                [ 38.078917] pcieport 0000:02:00.0: bridge window [io 0x5000-0x5fff]
                [ 38.078920] pcieport 0000:02:00.0: bridge window [mem 0x56100000-0x561fffff]
                [ 38.078923] pcieport 0000:02:00.0: bridge window [mem 0x4200000000-0x47ffffffff 64bit pref]
                [ 38.078933] amdgpu 0000:03:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
                [ 38.078934] amdgpu 0000:03:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
                [ 38.078936] amdgpu 0000:03:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
                [ 38.078941] [drm] Detected VRAM RAM=16368M, BAR=16384M
                [ 38.078942] [drm] RAM width 4096bits HBM
                [ 38.079026] [drm] amdgpu: 16368M of VRAM memory ready
                [ 38.079029] [drm] amdgpu: 16368M of GTT memory ready.
                [ 38.079038] [drm] GART: num cpu pages 131072, num gpu pages 131072
                [ 38.079115] [drm] PCIE GART of 512M enabled (table at 0x00000080012E6000).
                [ 38.084142] [drm] use_doorbell being set to: [true]
                [ 38.084887] [drm] use_doorbell being set to: [true]
                [ 38.084937] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
                [ 38.086660] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
                [ 38.086663] [drm] PSP loading UVD firmware
                [ 38.087844] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
                [ 38.087865] [drm] PSP loading VCE firmware
                [ 38.099752] iwlwifi 0000:00:14.3: base HW address: 04:ea:56:b4:04:f6
                [ 38.165751] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
                [ 38.165970] thermal thermal_zone3: failed to read out thermal zone (-61)
                [ 38.167056] iwlwifi 0000:00:14.3 wlo1: renamed from wlan0
                [ 38.330188] [drm] failed to retrieve link info, disabling eDP
                [ 38.514118] [drm] reserve 0x400000 from 0x83fe800000 for PSP TMR
                [ 38.554564] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer
                [ 38.554565] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
                [ 38.554565] RAPL PMU: hw unit of domain package 2^-14 Joules
                [ 38.554565] RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
                [ 38.557387] cryptd: max_cpu_qlen set to 1000
                [ 38.561196] AVX2 version of gcm_enc/dec engaged.
                [ 38.561197] AES CTR mode by8 optimization enabled
                [ 38.589962] Adding 8388604k swap on /dev/mapper/pve-swap. Priority:-2 extents:1 across:8388604k SSFS
                [ 38.699337] intel_rapl_common: Found RAPL domain package
                [ 38.699338] intel_rapl_common: Found RAPL domain core
                [ 38.699339] intel_rapl_common: Found RAPL domain uncore
                [ 38.737563] [drm] psp command failed and response status is (0x100)
                [ 38.849601] [drm] Display Core initialized with v3.2.48!
                [ 38.849675] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
                [ 39.020273] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
                [ 39.020273] [drm] Driver supports precise vblank timestamp query.
                [ 39.062156] [drm] UVD and UVD ENC initialized successfully.
                [ 39.261137] [drm] VCE initialized successfully.
                [ 39.262472] kfd kfd: Allocated 3969056 bytes on gart
                [ 39.263142] Virtual CRAT table created for GPU
                [ 39.263142] Parsing CRAT table with 1 nodes
                [ 39.263149] Creating topology SYSFS entries
                [ 39.263222] Topology: Add dGPU node [0x66af:0x1002]
                [ 39.263223] kfd kfd: added device 1002:66af
                [ 39.266215] [drm] fb mappable at 0x440193B000
                [ 39.266215] [drm] vram apper at 0x4400000000
                [ 39.266216] [drm] size 19906560
                [ 39.266216] [drm] fb depth is 24
                [ 39.266216] [drm] pitch is 13824
                [ 39.266341] fbcon: amdgpudrmfb (fb0) is primary device
                [ 39.392016] Console: switching to colour frame buffer device 240x67
                [ 39.410359] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
                [ 39.721624] amdgpu 0000:03:00.0: ring gfx uses VM inv eng 0 on hub 0
                [ 39.721625] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
                [ 39.721625] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
                [ 39.721626] amdgpu 0000:03:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
                [ 39.721626] amdgpu 0000:03:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
                [ 39.721627] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
                [ 39.721627] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
                [ 39.721627] amdgpu 0000:03:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
                [ 39.721628] amdgpu 0000:03:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
                [ 39.721628] amdgpu 0000:03:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
                [ 39.721629] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 0 on hub 1
                [ 39.721629] amdgpu 0000:03:00.0: ring page0 uses VM inv eng 1 on hub 1
                [ 39.721630] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 4 on hub 1
                [ 39.721630] amdgpu 0000:03:00.0: ring page1 uses VM inv eng 5 on hub 1
                [ 39.721631] amdgpu 0000:03:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
                [ 39.721631] amdgpu 0000:03:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
                [ 39.721632] amdgpu 0000:03:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
                [ 39.721632] amdgpu 0000:03:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
                [ 39.721633] amdgpu 0000:03:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
                [ 39.721633] amdgpu 0000:03:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
                [ 39.721634] amdgpu 0000:03:00.0: ring vce0 uses VM inv eng 12 on hub 1
                [ 39.721634] amdgpu 0000:03:00.0: ring vce1 uses VM inv eng 13 on hub 1
                [ 39.721634] amdgpu 0000:03:00.0: ring vce2 uses VM inv eng 14 on hub 1
                [ 39.721635] [drm] ECC is not present.
                [ 39.721635] [drm] SRAM ECC is not present.
                [ 39.722023] Detected AMDGPU DF Counters. # of Counters = 4.
                [ 39.722036] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:03:00.0 on minor 1
                [ 39.722566] [drm] Initialized i915 1.6.0 20190822 for 0000:00:02.0 on minor 0
                [ 39.722996] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
                [ 39.758385] [drm] Cannot find any crtc or sizes
                [ 39.790084] [drm] Cannot find any crtc or sizes
                [ 39.820200] [drm] Cannot find any crtc or sizes
                [ 2216.103474] [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon
                [ 2216.103980] [drm] amdgpu: finishing device.
                [ 2216.223161] Console: switching to colour dummy device 80x25
                [ 2216.383460] [drm] amdgpu: ttm finalized
                [ 2216.402754] vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
                [ 2218.399051] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
                [ 2218.399059] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
                [ 2237.830880] usb 1-7.1.1: reset full-speed USB device number 9 using xhci_hcd
                [ 2238.218853] usb 1-7.1.3.2: reset low-speed USB device number 13 using xhci_hcd
                [ 2594.687073] vfio-pci 0000:03:00.1: Refused to change power state, currently in D0
                [ 2599.275450] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2599.291422] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2600.544777] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2600.544925] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2600.560860] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2600.561006] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2600.573161] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2600.573309] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2600.587327] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.136257] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231596] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231718] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231729] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231745] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231825] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231836] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231906] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.231917] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.503789] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.503888] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.510840] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.510946] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.510958] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.511062] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.522300] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.522429] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
                [ 2601.612251] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
                root@pve:~#

  8. You’ve got several problems there, vendor-reset isn’t being loaded at all, which is weird, but you also haven’t blacklisted amdgpu so that’s loading during host startup, and you don’t want that to happen either.

    Try “update-initramfs -u -k all” just in case it didn’t update the kernel you’re booting from last time.

    1. I blacklisted “amdgpu” and ran this cmd: “update-initramfs -u -k all”.

      After restarting Proxmox Windows seems to start but won’t give me a picture. Even with Anydesk where I can see that the system started I see a black screen without anything else.

      dmesg shows me this:
      [ 303.784162] vfio-pci 0000:03:00.0: BAR 0: can’t reserve [mem 0x6030000000-0x603fffffff 64bit pref]
      [ 303.784177] vfio-pci 0000:03:00.0: BAR 0: can’t reserve [mem 0x6030000000-0x603fffffff 64bit pref]
      [ 303.784186] vfio-pci 0000:03:00.0: BAR 0: can’t reserve [mem 0x6030000000-0x603fffffff 64bit pref]

      1. I think that’s because the GPU is being used for your host console. If you have another GPU, set that as the primary instead in your host UEFI settings. If you only have this one GPU, add this to your kernel arguments to disable host video “video=vesafb:off,efifb:off”. Note that you’ll want to ensure you have a way of connecting to Proxmox that doesn’t require the screen, since it’ll stop outputting video right at the start of boot.

          1. Yes, if you have an iGPU, set it as primary and you don’t need to add the video=vesafb:off,efifb:off argument.

            1. Sadly Proxmox won’t even let me boot into the bootloader of the Mac VM.

              I only get this massage:
              [ 1038.500624] vfio-pci 0000:03:00.0: timed put waiting for pending transaction; performing function level reset anyway

              1. vendor-reset still isn’t being loaded. Try “modprobe vendor-reset” (reboot host first to reset the already-broken GPU)

        1. So now with changed primary output I can boot into Windows and see whats on the screen.

          But after another try to install the AMD driver I got the same error massage in Windows as before:
          No AMD graphics driver is installer, or the AMD driver is not functioning properly. Please install the AMD driver appropriate for your AMD hardware.

          I loaded the same (updated) GPU driver as on my natively running Windows 10 machine. So that is for sure the right one.

          1. BTW. this error massage comes right after the installation was completed successfully. I can even see the AMD icon in the system tray.

  9. After using this cmd: “modprobe vendor-reset”
    I get this: FATAL: Module vendor-reset not found in directory /lib/modules/5.4.73-1-pve

          1. I installed everything in the right order like you listed up without any complaints on the Proxmox side. I get the massages that I can’t load another module because I already have them installed.

            The only problem I get (at least in the Proxmox install process) is the very last part:

            # Enable vendor-reset to be loaded automatically on startup:
            echo “vendor-reset” >> /etc/modules
            update-initramfs -u

            That results into this:
            Running hook script ‘zz-pve-efiboot’..
            Re-executing ‘/etc/kernel/postinst.d/zz-pve-efiboot’ in new private mount namespace..
            No /etc/kernel/pve-efiboot-uuids found, skipping ESP sync.

              1. I did this and completely re-installed Catalina.
                Now it seems to work fine!

                But in Windows I always get the weird error massage:
                No AMD graphics driver is installer, or the AMD driver is not functioning properly. Please install the AMD driver appropriate for your AMD hardware.

                I think a Windows re-install would be the next step, maybe it also fixes this.

                All in all thank you very much for your patience, you helped me a lot!

  10. N00b question: How would I go about updating to the latest vendor-reset version?

    Remove the vendor-reset directory and git clone anew?

    1. Trit git pull, will find out if that’s correct:

      cd into the directory that contains the .git file. in my case this is at
      /root/vendor-reset/vendor-reset

      and from there type

      git pull origin master

  11. I had the same issue with the RX 580, by isolating the cores that macOS guest is using, from the host, this issue “almost” disappear. (1 in 20 guest reboots the host gets crashed)

    1. Very interesting, thanks for that tidbit! My VM has every core I have assigned to it, so maybe this is why it crashes so frequently for me.

      1. I’m now debugging and banging my head against the wall for over a week. So far I got very far tho.
        To get rid of “AER: Uncorrected (Non-Fatal) error received” you need to append pcie_aspm=off to your boot vars. This disables the power management for PCIe, it’s the best fix so far.

        Give your host 2 core and isolate the rest, assign the isolated cores via pinning to the guest. This should work pretty much well if your host is running only kvm.
        “`

        “`

        You can check isolated cores via:
        root@sysio # cat /sys/devices/system/cpu/isolated
        14-27,42-55

        I’m using unraid as a host, this is how my boot vars look like:
        “`
        Unraid OS
        kernel /bzimage
        append initrd=/bzroot vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_aspm=off isolcpus=14-27,42-55 video=efifb:off
        “`

        Cheers!

  12. just wanted to say THANKS SO MUCH !!! This is working way better than without it, not perfect, but close enough. I have an Asus Zenith Extreme Alpha (latest bios) and an ASRock Radeon RX 5700 XT DirectX 12 RX 5700 XT CHALLENGER D 8G OC 8GB 256-Bit GDDR6 PCI Express 4.0 video card. I was using Proxmox 6.4 prior to this and upgraded to 7.0 hoping it would fix the issue, no change. Basically if I shutdown my windows VM with this video card passed through then start it up, the video card would get locked up and I had to reboot the server to fix things. With this installed, I no longer see the card lockup, but sometimes when I start the VM again the video card won’t work, shut the vm down and start it up again and it works.

  13. What is the current status of the Sapphire RX580 (Mine’s the 4G model) in a Hackintosh with PCIe passthrough under ProxMox as of Oct 2021?

  14. I have problem with my proxmox single gpu passthrough. I installed vendor reset but it’s the same … after I start the VM my monitor lose signal. VM and the server are running, but nothing appears on monitor. I use RX 470 and enable ACS in my BIOS because without it the whole server crash when I start the VM. I really don’t know how to do next in order to make this show signal on my monitor.

    1. I make this working by enabled my iGPU, boot from there and now when I’m start the VM and switch to HDMI port of my external GPU I see signal, but can’t do anything because for some reason my keyboard is still attached to the server and not to the VM, my mouse is not attached eather. I passthrough my USB ports but my mouse and keyboard still don’t work in the VM. Im not sure what to do next. If I switch to the HDMI port of my IGPU I see that my keyboard is still atached to the server.

  15. First of all, thanks for these brilliant tutorials. I am using an SAPPHIRE RX 580 and it shows the mentioned DMAR bug on any second init after vm start. Even when switching from UEFI screen to start Recovery.
    Is there any chance do debug vendor reset and get it to some output if it comes into play.
    The module is loaded and amdgpu is blacklisted. So far, so good.
    ID GPU is 1002:67DF
    ID AUDIO is 1002:AAF0
    Kernel is 5.13.10-2-pve

    Till update to ProxMox 7.1 it kind of worked for me.

    Could you also re-describe the rom-extraction process under DEBIAN Linux.?

    Thanx in advance…

  16. you still there nic?
    just trying to setup single GPU passthrough recently and got hit by this bug
    without vendor-reset sometimes the GPU recover after stopping VM, sometimes not

    installed vendor-reset DKMS
    check with lsmod, it says loaded
    but it seems somehow not activated? when starting VM

    there are no logs like this in my dmesg when starting VM

    vfio-pci 0000:03:00.0: AMD_POLARIS10: version 1.0
    vfio-pci 0000:03:00.0: AMD_POLARIS10: performing pre-reset
    vfio-pci 0000:03:00.0: AMD_POLARIS10: performing reset
    vfio-pci 0000:03:00.0: AMD_POLARIS10: GPU pci config reset
    vfio-pci 0000:03:00.0: AMD_POLARIS10: performing post-reset
    vfio-pci 0000:03:00.0: AMD_POLARIS10: reset result = 0

    any idea why?

  17. Hello Nick

    When I run the update command (the very first on) I guet errors:

    /usr/sbin/grub-mkconfig: 34: /etc/default/grub: vfio-pci.ids=1002:731f,1002:ab38: not found
    run-parts: /etc/kernel/postinst.d/zz-update-grub exited with return code 127
    Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/pve-kernel-5.13.19-6-pve.postinst line 19.
    dpkg: error processing package pve-kernel-5.13.19-6-pve (–configure):
    installed pve-kernel-5.13.19-6-pve package post-installation script subprocess returned error exit status 2
    dpkg: dependency problems prevent configuration of pve-kernel-5.13:
    pve-kernel-5.13 depends on pve-kernel-5.13.19-6-pve; however:
    Package pve-kernel-5.13.19-6-pve is not configured yet.

    dpkg: error processing package pve-kernel-5.13 (–configure):
    dependency problems – leaving unconfigured
    Errors were encountered while processing:
    pve-kernel-5.13.19-6-pve
    pve-kernel-5.13
    E: Sub-process /usr/bin/dpkg returned an error code (1)

    Is it a sub problem ? I have the No-Sub repositories.

        1. Yes, otherwise if you don’t have a subscription you won’t receive any Proxmox security or feature updates, which cripples the system pretty quickly.

  18. So I recreated everything and followed both the GPU passthrough guide and this one but no luck. Now my Monterey VM starts, then hangs, reboots and then slowly boots at the Apple logo.

    It’s not really surprising as I got this error:

    root@theclawpve:~/vendor-reset# dkms install .

    Creating symlink /var/lib/dkms/vendor-reset/0.1.1/source ->
    /usr/src/vendor-reset-0.1.1

    DKMS: add completed.
    Error! Your kernel headers for kernel 5.13.19-2-pve cannot be found.
    Please install the linux-headers-5.13.19-2-pve package,
    or use the –kernelsourcedir option to tell DKMS where it’s located

    And I don’t know what to do. All the other commands went fine.

  19. Hey, I tried using this module, but it sadly didn’t really help. Instead of the dmesg output I should be getting, I just got:

    [ 311.117750] vfio-pci 0000:2a:00.0: AMD_VEGA10: version 1.0
    [ 311.117754] vfio-pci 0000:2a:00.0: AMD_VEGA10: performing pre-reset
    [ 311.117866] vfio-pci 0000:2a:00.0: AMD_VEGA10: performing reset
    [ 311.432665] ATOM BIOS: xxx-xxx-xxx
    [ 311.432667] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
    [ 311.659454] vfio-pci 0000:2a:00.0: AMD_VEGA10: bus reset disabled? yes
    [ 311.659460] vfio-pci 0000:2a:00.0: AMD_VEGA10: SMU response reg: ffffffff, sol reg: 0, mp1 intr enabled? no, bl ready? no, baco? off
    [ 311.659463] vfio-pci 0000:2a:00.0: AMD_VEGA10: performing post-reset
    [ 311.699783] vfio-pci 0000:2a:00.0: AMD_VEGA10: reset result = 0

    over and over again. Any ideas what could be the issue? Any kind of help would be massively appreciated :-).

  20. After multiple attempts with ubuntu and unraid to get GPU passthrough to work this finally solved my woes!

    Albeit I have the following issues still:

    -Windows VM has to be shutdown, if restarted the gpu doesnt appear to work when rebooted. The VM shows the card however no output is shown
    -Clean boot of Proxmox and Windows VM consistantly results in native performance in Tomb Raider benchmark. Upon shutting down VM and rebooting the performance is under half that consistantly until the host is shutdown/restarted.

    I have looked into numa nodes and CPU pinning however my Epyc 7252 is a single numa node design so don’t think the issue is with pcie lane latency etc.

    Any suggestions?

    1. Update – Apparently just disabling and enabling the gpu within windows solved the performance issues and restarting the vm is also possible!

      I’ve modified the following Powershell script (https://superuser.com/questions/1165637/one-gesture-solution-to-disable-enable-a-device-in-device-manager-without-a-thi)

      C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -command “Get-PnpDevice -FriendlyName “Radeon*” | Disable-PnpDevice -confirm:$false; Get-PnpDevice -FriendlyName “Radeon*” | Enable-PnpDevice -confirm:$false”

      Hope it helps someone who has similar issues

  21. Hey Nick,

    Really enjoying the guides. Is there a way to install this reset on Proxmox 6.4 and older versions. I am staying back so I can run Mojave and the reset bug is starting to drive me a bit mad with my Vega 64

  22. When proxmox starts up I get

    [4.291598] [drm:amdgpu_init [amdgpu]] * ERROR VGACON disables amdgpu kernel

    I have no Idea what the problem is

    proxmox loads after seeing that and my 2 vms with passthru gpus 1 nvdia and 1 amd gpu load like they should

    but when I test my install media to install proxmox from scratch I get Install proxmox screen

    then when I click install I get a blank white screen which is connected to my amdgpu

    I think somehow they are related, but not sure how, anyone have any ideas please?

    Im worried if I had to start from scratch I wont be able to reinstall

    does this sound like the amd reset bug?
    I just seems strange that its doing it on booting into install usb

  23. With my Proxmox PVE 5.15.30-2-pve (Debian) installation and Sapphire Radeon RX 5700 XT, the trick was to change the ‘reset_method’ of the GPU to ‘device_specific’:

    echo 'device_specific' > /sys/bus/pci/devices/reset_method

    Use lspci to find your PCI device id and set the the reset method right after a clean boot into Proxmox.

    When I did this, I finally could see Nicholas’s mentioned log lines in dmesg (tip: dmesg -w | grep ‘vend\|vfio’ ) and the vendor-reset was working.

    1. Hi there, I tried but I have an invalid argument….any idea ? I ran the command just after a clean boot into proxmox with all VM being turned off. Thx 😉

      1. Make sure you didn’t copy and paste the curly quotes that WordPress has inserted around device_specific. I’ve edited that comment now to fix those

        1. Thank you. I still get the same answer : invalid argument. I ran this line :
          echo device_specific > /sys/bus/pci/devices/0000:04:00.0/reset_method
          Best regard
          Sam

          1. It seems that my header is not the good one for my kernel version…but I couldn’t figure how to make it work….

  24. Hello, when I passthrough my RX580 2048SP to Windows PC, the host occasionally freezes when restarting the virtual machine (normal most of the time)
    The system logs are as follows:
    Nov 6 18:28:06 pve kernel: [ 9801.522471] DMAR: DRHD: handling fault status reg 40
    Nov 6 18:28:06 pve kernel: [ 9801.523647] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x10005293c
    Nov 6 18:28:06 pve kernel: [ 9801.527784] DMAR: Invalidation Time-out Error (ITE) cleared
    Nov 6 18:28:06 pve kernel: [ 9801.527844] DMAR: VT-d detected Invalidation Time-out Error: SID 0
    Looking forward your reply!

    1. As far as I know this is unavoidable, vendor-reset didn’t fix this on my RX 580 either.

      Someone mentioned that if they avoid passing through the audio device on their RX 580 it improves this situation.

  25. root@pvemini:~/vendor-reset# dkms remove vendor-reset/0.1.1 –all
    Warning: I do not know how to handle –all.
    Error! There is no instance of vendor-reset 0.1.1
    for kernel 5.15.83-1-pve (x86_64) located in the DKMS tree.
    root@pvemini:~/vendor-reset# dkms status
    vendor-reset, 0.1.1: added
    root@pvemini:~/vendor-reset# HOW TO UNINSTALL

    1. If you didn’t reboot the host before running dkms, the odds are that dkms built the module for the installed kernel, but the running kernel is older.

      If you reboot it should bring those two versions back into line.

    1. RX680M is RDNA2, in theory the reset bug doesn’t exist in that generation.

      vendor-reset has no support for it

      I’ve never seen anybody pass through an AMD APU successfully, are you sure that AMD even supports that?

  26. I successfully booted my Win11 VM on a AMD Cezanne APU! (Ryzen 5800H). I created a vBIOS with UEFI support using GOP Updater.

    Now my problem is the Reset Bug of course. I’m able to start VM only once.
    Gnef is not willing to put effort in vendor-reset to make my CPU work.

    Anyone can take over? What exactly has to be researched?

    1. So far the work has been porting the reset routines from the Linux amdgpu opensource driver (contributed by AMD) to the vendor-reset module.

  27. Let me say that your guide is probably the best out there.

    Nevertheless I seem to make a mistake somehow some where.

    I got the installation of ventura working smoothly and it works via console like a charm. nevertheless when I try to pass through the gpu (ASUS RX 6800 XT 16g) I stuck at the apple logo. I know a lot of guys out there have “similar” issue but not the same. I can run ventura over console. Most of the guys have the issue even on the console. I only have it when trying the GPU connection. I select my ventura disk, the apple logo pops up, the progressbar too after 1sec but then it stucks there. I get the feeling that the system might load in the background and I only can’t see as the screen is stuck at the logo screen. But god knows.

    I know the GPU and prox is set up correctly at least for another VM as I am running W11 on another VM and there the GPU works very good. Of course I do not use them at same time but shutdown one vm when working on the other to avoid issues.

    Here I am … still waiting for the progress bar to move… 🙁

    Any help is welcome

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.