Datadog launches GPU Monitoring as AI costs soar and GPU instances hit 14% of cloud compute

2 Sources

Share

Datadog introduces GPU Monitoring to help organizations manage escalating AI costs as GPU instances now account for 14% of cloud compute spending. The observability platform provides unified visibility across AI infrastructure, linking GPU fleet health, cost, and performance to help teams identify inefficiencies, troubleshoot workloads faster, and reduce wasted spending on expensive silicon.

Datadog Addresses Escalating AI Costs with New GPU Monitoring Tool

Datadog has launched GPU Monitoring for its observability platform, targeting one of the most pressing challenges facing AI-driven organizations: managing rapidly escalating AI costs

1

. The timing is critical, as GPU instances now represent 14 percent of cloud compute costs, a figure expected to grow substantially as companies expand their AI investments

2

. According to IDC, worldwide spending on AI infrastructure reached $89.9 billion in Q4 2025, up 62 percent year-over-year, with accelerated compute—primarily GPUs—forming the structural backbone of this growth

1

.

Unified Visibility Across the AI Stack Enables Cost Optimization

The challenge for most organizations isn't just rising costs but the inability to understand where money is being wasted. "While these companies can see their costs climbing, they can't chargeback GPU spend across business units, see workload context or identify clear next steps for improvement," explained Yanbing Li, Chief Product Officer at Datadog

2

. GPU Monitoring addresses this gap by providing unified visibility that links GPU fleet health, cost, and performance directly to the teams using them

1

. The tool works across cloud, neocloud instances, and on-premises GPU fleets, making it valuable for organizations with sovereignty concerns about AI in the cloud

1

.

Source: CXOToday

Source: CXOToday

Identifying AI Workloads That Waste Resources and Money

The platform excels at spotting GPU efficiency problems that silently drain budgets. Teams can identify portions of their fleet sitting completely idle or being consumed by AI workloads that don't require GPUs at all

1

. It detects zombie processes soaking up GPU time and workloads never configured for GPUs in the first place, effectively burning cash

1

. Datadog's own internal use case demonstrates the impact: GPU Monitoring helped the company save tens of thousands in monthly expenses by identifying and removing a serving pod stuck in the initialization phase

1

.

Faster Troubleshooting and Performance Bottlenecks Resolution

Most existing GPU tools provide only high-level device health metrics without surfacing resource contention issues or explaining why training and inference workloads fail

2

. This lack of visibility forces teams to overprovision as the safest default, leading to wasted spending

2

. GPU Monitoring streamlines troubleshooting by correlating stalled workloads directly to underlying GPUs, pods, and processes, allowing engineers to resolve performance bottlenecks in minutes instead of hours

2

.

AI Infrastructure Efficiency Becomes a Competitive Necessity

"Smartly managing AI spend becomes a board-level conversation when capacity is misallocated, training and inference workloads stall, and costs escalate," Li noted

2

. The tool enables platform engineering and machine learning teams to share a single view for investigation, helping them maximize return on investment with predictable spending patterns

2

. Kai Huang, Head of Product at Hyperbolic, highlighted the practical benefits: "We get per-instance, per-device visibility into core utilization, memory, power and thermals right out of the box with no extra setup," adding that integrating LLM Observability allows teams to trace model latency spikes straight to underlying GPU metrics

2

.

Growing Market for AI Observability Solutions

Datadog isn't alone in extending observability deeper into the AI stack. Grafana recently launched observability tools for AI with insights into agent behavior, while Grafana Cloud offers GPU observability covering hardware utilization and resource allocation for cost optimization

1

. Nutanix unveiled a multi-tenancy framework to run more workloads on existing GPUs with better insight into token consumption

1

. As organizations struggle to determine whether they're getting value from massive AI investments, tools that reduce wasted spending while maintaining performance are becoming essential for managing the economics of AI at scale

1

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo