2 Sources
2 Sources
[1]
Nvidia RTX 5090 reset bug prompts $1,000 reward for a fix -- cards become completely unresponsive and require a reboot after virtualization reset bug, also impacts RTX PRO 6000
CloudRift and community reports suggest a reset failure on Nvidia's new Blackwell GPUs that bricks the card until the machine is power-cycled. Nvidia's new RTX 5090 and RTX PRO 6000 GPUs are reportedly being plagued by a reproducible virtualization reset bug that can leave the cards completely unresponsive until the host system is physically rebooted. CloudRift, a GPU cloud provider, published a detailed breakdown of the issue after encountering it on multiple Blackwell-equipped systems in production. The company has even issued a $1,000 public bug bounty for anyone able to identify a fix or root cause. According to CloudRift's logs, the bug occurs after a GPU has been passed through to a VM using KVM and VFIO. On guest shutdown or GPU reassignment, the host issues a PCIe function-level reset (FLR), which is a standard part of cleaning up a passthrough device. But instead of returning to a known-good state, the GPU fails to respond: "not ready 65535ms after FLR; giving up," the kernel reports. At this point, the card also becomes unreadable to lspci, which throws "unknown header type 7f," errors. CloudRift notes that the only way to restore normal operation is to power-cycle the entire machine. Tiny Corp, the AI start-up behind tinygrad, brought attention to the issue by reposting CloudRift's findings on X.com with a blunt question: "Do 5090s and RTX PRO 6000s have a hardware defect? We've looked into this and can't find a fix." Threads across the Proxmox forums and Level1Techs community suggest that home users and other early adopters of the RTX 5090 are also encountering similar behavior. In one case, a user reported a complete host hang after a Windows guest was shut down, with the GPU failing to reinitialize even after an OS-level reboot. In another case, a user said, "I found my host became unresponsive. Further debugging shows that the host CPU got soft lock [sic] after a FLO timeout, which is after a shutdown of LinuxVM. No issue for my previous 4080." Several users confirm that toggling PCIe ASPM or ACS settings does not mitigate the failure. No issues have been reported with older cards such as the RTX 4090, suggesting that the bug may be limited to Nvidia's Blackwell family. FLR is a critical feature in GPU passthrough configurations, allowing a device to be safely reset and reassigned between guests. If FLR is unreliable, then multi-tenant AI workloads and home lab setups using virtualization become risky, particularly when a single card failure takes down the entire host. Nvidia has not yet officially acknowledged the issue, and there is no known mitigation at the time of writing.
[2]
NVIDIA's High-End GeForce RTX 5090 & RTX PRO 6000 GPUs Reportedly Bricked by Virtualization Bug, Requiring Full System Reboot to Recover
It seems like NVIDIA's flagship GPUs, the GeForce RTX 5090 and the RTX PRO 6000, have encountered a new bug that involves unresponsiveness under virtualization. NVIDIA's Flagship Blackwell GPUs Are Becoming 'Unresponsive' After Extensive VM Usage CloudRift, a GPU cloud for developers, was the first to report crashing issues with NVIDIA's high-end GPUs. According to them, after the SKUs were under a 'few days' of VM usage, they started to become completely unresponsive. Interestingly, the GPUs can no longer be accessed unless the node system is rebooted. The problem is claimed to be specific to just the RTX 5090 and the RTX PRO 6000, and models such as the RTX 4090, Hopper H100s, and the Blackwell-based B200s aren't affected for now. The problem specifically occurs when the GPU is assigned to a VM environment using the device driver VFIO, and after the Function Level Reset (FLR), the GPU doesn't respond at all. The unresponsiveness then results in a kernel 'soft lock', which puts the host and client environments under a deadlock. To get out of it, the host machine has to be rebooted, which is a difficult procedure for CloudRift, considering the volume of their guest machines. This issue isn't limited to CloudRift only. A user at Proxmox has reported a similar issue, where he saw a complete host crash after shutting down a Windows client. Interestingly, he says that NVIDIA has responded to the problem, claiming that the firm has been able to reproduce the issue and is working on a fix. We are waiting on an official confirmation from NVIDIA, but it seems like the problem is specific to Blackwell-based GPUs. Interestingly, CloudRift has put out a $1,000 bug bounty for those who can fix or mitigate the issue, and we are expecting NVIDIA to release a fix soon, considering that it is affecting crucial AI workloads.
Share
Share
Copy Link
NVIDIA's latest high-end GPUs, the RTX 5090 and RTX PRO 6000, are experiencing a critical virtualization bug that causes unresponsiveness and requires system reboots, impacting AI workloads and cloud services.
NVIDIA's latest high-end graphics cards, the GeForce RTX 5090 and RTX PRO 6000, are reportedly facing a significant virtualization bug that renders the GPUs unresponsive and requires a full system reboot to recover
1
. This issue, which appears to be specific to NVIDIA's Blackwell family of GPUs, is causing concern among cloud providers, AI researchers, and home users alike.Source: Tom's Hardware
The problem occurs when the GPU is assigned to a virtual machine (VM) environment using the device driver VFIO. After a Function Level Reset (FLR), which is a standard part of cleaning up a passthrough device, the GPU fails to respond
2
. CloudRift, a GPU cloud provider, reports that the kernel logs show "not ready 65535ms after FLR; giving up," indicating a failure in the reset process1
.The unresponsiveness results in a kernel 'soft lock', putting both the host and client environments in a deadlock. The only way to restore normal operation is to power-cycle the entire machine, which is particularly problematic for cloud providers managing numerous guest machines
1
. This issue significantly impacts multi-tenant AI workloads and home lab setups using virtualization, as a single card failure can take down the entire host system.The bug has been reported by various sources, including CloudRift, Proxmox forum users, and the Level1Techs community. Tiny Corp, an AI start-up, has also brought attention to the issue, questioning whether the RTX 5090 and RTX PRO 6000 have a hardware defect
1
. Importantly, older models such as the RTX 4090, Hopper H100s, and even the Blackwell-based B200s are not affected by this bug2
.Related Stories
While NVIDIA has not yet officially acknowledged the issue publicly, a user on the Proxmox forum reported that the company has been able to reproduce the problem and is working on a fix
2
. In an effort to expedite a solution, CloudRift has issued a $1,000 public bug bounty for anyone able to identify a fix or root cause1
.This virtualization bug poses significant challenges for the AI and cloud computing industries, which rely heavily on high-performance GPUs for their operations. The inability to safely reset and reassign GPUs between guests in multi-tenant environments could lead to increased downtime and reduced efficiency in data centers and cloud services.
As the AI industry continues to grow and demand for high-performance computing resources increases, resolving this issue promptly is crucial for NVIDIA to maintain its position as a leader in the GPU market. The coming days will likely see increased pressure on NVIDIA to provide a comprehensive fix and ensure the reliability of their flagship products in virtualized environments.
Summarized by
Navi
1
Business and Economy
2
Technology
3
Business and Economy