NVIDIA's Flagship GPUs Face Virtualization Bug: RTX 5090 and PRO 6000 Affected

Reviewed byNidhi Govil

2 Sources

Share

NVIDIA's latest high-end GPUs, the RTX 5090 and RTX PRO 6000, are experiencing a critical virtualization bug that causes unresponsiveness and requires system reboots, impacting AI workloads and cloud services.

The Virtualization Bug: A Critical Issue for NVIDIA's Flagship GPUs

NVIDIA's latest high-end graphics cards, the GeForce RTX 5090 and RTX PRO 6000, are reportedly facing a significant virtualization bug that renders the GPUs unresponsive and requires a full system reboot to recover

1

. This issue, which appears to be specific to NVIDIA's Blackwell family of GPUs, is causing concern among cloud providers, AI researchers, and home users alike.

Source: Tom's Hardware

Source: Tom's Hardware

The Nature of the Bug

The problem occurs when the GPU is assigned to a virtual machine (VM) environment using the device driver VFIO. After a Function Level Reset (FLR), which is a standard part of cleaning up a passthrough device, the GPU fails to respond

2

. CloudRift, a GPU cloud provider, reports that the kernel logs show "not ready 65535ms after FLR; giving up," indicating a failure in the reset process

1

.

Impact on Systems and Workloads

The unresponsiveness results in a kernel 'soft lock', putting both the host and client environments in a deadlock. The only way to restore normal operation is to power-cycle the entire machine, which is particularly problematic for cloud providers managing numerous guest machines

1

. This issue significantly impacts multi-tenant AI workloads and home lab setups using virtualization, as a single card failure can take down the entire host system.

Widespread Reports and Confirmations

The bug has been reported by various sources, including CloudRift, Proxmox forum users, and the Level1Techs community. Tiny Corp, an AI start-up, has also brought attention to the issue, questioning whether the RTX 5090 and RTX PRO 6000 have a hardware defect

1

. Importantly, older models such as the RTX 4090, Hopper H100s, and even the Blackwell-based B200s are not affected by this bug

2

.

NVIDIA's Response and Community Action

While NVIDIA has not yet officially acknowledged the issue publicly, a user on the Proxmox forum reported that the company has been able to reproduce the problem and is working on a fix

2

. In an effort to expedite a solution, CloudRift has issued a $1,000 public bug bounty for anyone able to identify a fix or root cause

1

.

Implications for the Industry

This virtualization bug poses significant challenges for the AI and cloud computing industries, which rely heavily on high-performance GPUs for their operations. The inability to safely reset and reassign GPUs between guests in multi-tenant environments could lead to increased downtime and reduced efficiency in data centers and cloud services.

As the AI industry continues to grow and demand for high-performance computing resources increases, resolving this issue promptly is crucial for NVIDIA to maintain its position as a leader in the GPU market. The coming days will likely see increased pressure on NVIDIA to provide a comprehensive fix and ensure the reliability of their flagship products in virtualized environments.

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo