3 Sources
3 Sources
[1]
Nvidia RTX 5090 reset bug prompts $1,000 reward for a fix -- cards become completely unresponsive and require a reboot after virtualization reset bug, also impacts RTX PRO 6000
CloudRift and community reports suggest a reset failure on Nvidia's new Blackwell GPUs that bricks the card until the machine is power-cycled. Nvidia's new RTX 5090 and RTX PRO 6000 GPUs are reportedly being plagued by a reproducible virtualization reset bug that can leave the cards completely unresponsive until the host system is physically rebooted. CloudRift, a GPU cloud provider, published a detailed breakdown of the issue after encountering it on multiple Blackwell-equipped systems in production. The company has even issued a $1,000 public bug bounty for anyone able to identify a fix or root cause. According to CloudRift's logs, the bug occurs after a GPU has been passed through to a VM using KVM and VFIO. On guest shutdown or GPU reassignment, the host issues a PCIe function-level reset (FLR), which is a standard part of cleaning up a passthrough device. But instead of returning to a known-good state, the GPU fails to respond: "not ready 65535ms after FLR; giving up," the kernel reports. At this point, the card also becomes unreadable to lspci, which throws "unknown header type 7f," errors. CloudRift notes that the only way to restore normal operation is to power-cycle the entire machine. Tiny Corp, the AI start-up behind tinygrad, brought attention to the issue by reposting CloudRift's findings on X.com with a blunt question: "Do 5090s and RTX PRO 6000s have a hardware defect? We've looked into this and can't find a fix." Threads across the Proxmox forums and Level1Techs community suggest that home users and other early adopters of the RTX 5090 are also encountering similar behavior. In one case, a user reported a complete host hang after a Windows guest was shut down, with the GPU failing to reinitialize even after an OS-level reboot. In another case, a user said, "I found my host became unresponsive. Further debugging shows that the host CPU got soft lock [sic] after a FLO timeout, which is after a shutdown of LinuxVM. No issue for my previous 4080." Several users confirm that toggling PCIe ASPM or ACS settings does not mitigate the failure. No issues have been reported with older cards such as the RTX 4090, suggesting that the bug may be limited to Nvidia's Blackwell family. FLR is a critical feature in GPU passthrough configurations, allowing a device to be safely reset and reassigned between guests. If FLR is unreliable, then multi-tenant AI workloads and home lab setups using virtualization become risky, particularly when a single card failure takes down the entire host. Nvidia has not yet officially acknowledged the issue, and there is no known mitigation at the time of writing.
[2]
RTX 5090 and RTX PRO 6000 GPU have a new bug: need a full system reboot after virtualization
TL;DR: NVIDIA's GeForce RTX 5090 and RTX PRO 6000 GPUs face a critical virtualization bug causing system crashes and unresponsiveness after days of VM use, requiring full reboots. NVIDIA acknowledges the issue, affecting AI workloads, and is actively developing a fix while offering a $1000 bug bounty for solutions. NVIDIA's higher-end GeForce RTX 5090 and RTX PRO 6000 cards have hit a new bug after running virtualization for a few days, which requires a full system reboot to get them back online again. CloudRift is a GPU cloud for developers, reporting crashing issues with both the RTX 5090 and RTX PRO 6000 cards saying that after a "few days" of VM usage, the cards were completely unresponsive. The GPUs can no longer be accessed unless the node system is rebooted, but thankfully it's only happening to the RTX 5090 and RTX PRO 6000, as the RTX 4090, Hopper H100, and Blackwell B200 aren't affected, for now. What's happening exactly? The GPU gets assigned to a VM environment using the device driver VFIO, and after the Functional Level Reset (FLR), the GPU is completely unresponsive. After the GPU becomes unresponsive, it results in a kernel "soft lock" which puts the host and client environments under a deadlock. In order to get out of that deadlock the machine has to be rebooted, which isn't an easy thing to do for CloudRift, as they have a big volume of guest machines. CloudRift isn't the only company affected either, with a user at Proxmox reporting something similar, where he witnessed a complete host crash after shutting down a Windows client. He said that NVIDIA has responded to the problem, claiming that the company has been able to reproduce the issue, and is currently working on a fix for the issue. Better yet, CloudRift has a $1000 bug bounty for those who can fix, or mitigate the issue, but NVIDIA is at work on that fix which shouldn't be too far away, especially when it's having a negative effect on AI workloads.
[3]
NVIDIA's High-End GeForce RTX 5090 & RTX PRO 6000 GPUs Reportedly Bricked by Virtualization Bug, Requiring Full System Reboot to Recover
It seems like NVIDIA's flagship GPUs, the GeForce RTX 5090 and the RTX PRO 6000, have encountered a new bug that involves unresponsiveness under virtualization. NVIDIA's Flagship Blackwell GPUs Are Becoming 'Unresponsive' After Extensive VM Usage CloudRift, a GPU cloud for developers, was the first to report crashing issues with NVIDIA's high-end GPUs. According to them, after the SKUs were under a 'few days' of VM usage, they started to become completely unresponsive. Interestingly, the GPUs can no longer be accessed unless the node system is rebooted. The problem is claimed to be specific to just the RTX 5090 and the RTX PRO 6000, and models such as the RTX 4090, Hopper H100s, and the Blackwell-based B200s aren't affected for now. The problem specifically occurs when the GPU is assigned to a VM environment using the device driver VFIO, and after the Function Level Reset (FLR), the GPU doesn't respond at all. The unresponsiveness then results in a kernel 'soft lock', which puts the host and client environments under a deadlock. To get out of it, the host machine has to be rebooted, which is a difficult procedure for CloudRift, considering the volume of their guest machines. This issue isn't limited to CloudRift only. A user at Proxmox has reported a similar issue, where he saw a complete host crash after shutting down a Windows client. Interestingly, he says that NVIDIA has responded to the problem, claiming that the firm has been able to reproduce the issue and is working on a fix. We are waiting on an official confirmation from NVIDIA, but it seems like the problem is specific to Blackwell-based GPUs. Interestingly, CloudRift has put out a $1,000 bug bounty for those who can fix or mitigate the issue, and we are expecting NVIDIA to release a fix soon, considering that it is affecting crucial AI workloads.
Share
Share
Copy Link
Nvidia's RTX 5090 and RTX PRO 6000 GPUs are experiencing a severe virtualization reset bug that renders the cards unresponsive, requiring system reboots. The issue affects AI workloads and virtualization setups, with Nvidia working on a fix.
Nvidia's flagship GPUs, the RTX 5090 and RTX PRO 6000, are reportedly plagued by a critical virtualization reset bug that renders the cards unresponsive and requires a full system reboot to recover
1
2
. This issue is particularly concerning for AI workloads and virtualization setups, prompting a $1,000 bug bounty for anyone who can identify a fix or root cause.Source: Tom's Hardware
The problem occurs after a GPU has been passed through to a virtual machine (VM) using KVM and VFIO. When the guest is shut down or the GPU is reassigned, the host issues a PCIe function-level reset (FLR), which is a standard cleanup procedure for passthrough devices
1
. However, instead of returning to a known-good state, the GPU fails to respond, becoming unreadable to system tools and requiring a complete power cycle of the machine to restore normal operation.CloudRift, a GPU cloud provider, has reported encountering this issue on multiple Blackwell-equipped systems in production
1
2
. The bug is particularly problematic for multi-tenant AI workloads and home lab setups using virtualization, as a single card failure can take down the entire host system. Users across various forums, including Proxmox and Level1Techs, have reported similar behaviors, with some experiencing complete host hangs after shutting down Windows guests1
3
.Source: Wccftech
Related Stories
Nvidia has reportedly acknowledged the issue and is actively working on developing a fix
2
3
. In the meantime, CloudRift has issued a $1,000 public bug bounty for anyone able to identify a fix or root cause of the problem1
2
. The community is actively engaged in finding solutions, with users experimenting with various BIOS settings and configurations to mitigate the issue.This bug highlights the challenges faced by cutting-edge hardware in complex virtualization environments. As GPUs become increasingly crucial for AI and machine learning workloads, such issues can have significant implications for businesses relying on these technologies. The incident also underscores the importance of thorough testing and rapid response to hardware-level bugs in the fast-paced world of GPU development and deployment.
Summarized by
Navi
[2]
19 Mar 2025•Technology
05 Mar 2025•Technology
08 Sept 2025•Technology