Nvidia's Flagship GPUs Face Critical Virtualization Bug, Prompting $1,000 Bounty for Fix

Nvidia's Latest GPUs Face Severe Virtualization Bug

Nvidia's flagship GPUs, the RTX 5090 and RTX PRO 6000, are reportedly plagued by a critical virtualization reset bug that renders the cards unresponsive and requires a full system reboot to recover 1

. This issue is particularly concerning for AI workloads and virtualization setups, prompting a $1,000 bug bounty for anyone who can identify a fix or root cause.

Source: Tom's Hardware

The Nature of the Bug

The problem occurs after a GPU has been passed through to a virtual machine (VM) using KVM and VFIO. When the guest is shut down or the GPU is reassigned, the host issues a PCIe function-level reset (FLR), which is a standard cleanup procedure for passthrough devices 1

. However, instead of returning to a known-good state, the GPU fails to respond, becoming unreadable to system tools and requiring a complete power cycle of the machine to restore normal operation.

Impact on Users and Businesses

CloudRift, a GPU cloud provider, has reported encountering this issue on multiple Blackwell-equipped systems in production 1

. The bug is particularly problematic for multi-tenant AI workloads and home lab setups using virtualization, as a single card failure can take down the entire host system. Users across various forums, including Proxmox and Level1Techs, have reported similar behaviors, with some experiencing complete host hangs after shutting down Windows guests 1

Source: Wccftech

Nvidia's Response and Community Efforts

Nvidia has reportedly acknowledged the issue and is actively working on developing a fix 2

. In the meantime, CloudRift has issued a $1,000 public bug bounty for anyone able to identify a fix or root cause of the problem 1

. The community is actively engaged in finding solutions, with users experimenting with various BIOS settings and configurations to mitigate the issue.

Implications for the Industry

This bug highlights the challenges faced by cutting-edge hardware in complex virtualization environments. As GPUs become increasingly crucial for AI and machine learning workloads, such issues can have significant implications for businesses relying on these technologies. The incident also underscores the importance of thorough testing and rapid response to hardware-level bugs in the fast-paced world of GPU development and deployment.

Nvidia's Flagship GPUs Face Critical Virtualization Bug, Prompting $1,000 Bounty for Fix

Nvidia's Latest GPUs Face Severe Virtualization Bug

The Nature of the Bug

Impact on Users and Businesses

Nvidia's Response and Community Efforts

Implications for the Industry

References

Nvidia RTX 5090 reset bug prompts $1,000 reward for a fix -- cards become completely unresponsive and require a reboot after virtualization reset bug, also impacts RTX PRO 6000

RTX 5090 and RTX PRO 6000 GPU have a new bug: need a full system reboot after virtualization

NVIDIA's High-End GeForce RTX 5090 & RTX PRO 6000 GPUs Reportedly Bricked by Virtualization Bug, Requiring Full System Reboot to Recover

Related Stories

Nvidia Unveils RTX Pro 6000 Blackwell: A Powerhouse GPU for AI and Professional Workloads

NVIDIA RTX 4090 with 96GB VRAM: A Potential Game-Changer for AI Workloads

Chinese Factories Push GPU Boundaries: From RTX 4090 Mods to Rumored 128GB RTX 5090

Recent Highlights

X's Paywall Doesn't Stop Grok From Generating Nonconsensual Deepfakes and Explicit Images

Nvidia Vera Rubin architecture slashes AI costs by 10x with advanced networking at its core

OpenAI launches ChatGPT Health to connect medical records to AI amid accuracy concerns

Recent Highlights

Today's Top Stories

Walmart and Google partner on AI shopping through Gemini chatbot with instant checkout

Google launches Universal Commerce Protocol to power AI agents across shopping experiences

Elon Musk pledges to open source X algorithm in seven days with monthly updates

Anthropic launches Claude for Healthcare with health records access days after OpenAI's push