4 Sources
[1]
AI is starting to look a lot like the early days of cloud - and the real race is operational
Over the past two years, most of the noise around AI has focused on the model race - whose model is bigger, faster or scoring better on benchmarks. But as AI moves from pilots into the core of products and workflows, a familiar pattern from the early days of cloud is re‑emerging: systems are more programmable than ever, but they are also much harder to run. And that means we now know where the most important competition in AI is shifting: from who has the "best" model to who can operate AI reliably, efficiently, and safely at scale. AI is now hitting operational limits, not model limits When looking at real‑world telemetry from thousands of production systems, a clear picture starts to form. Nearly 1 in 20 AI requests fails once applications reach scale, and a majority of those failures now stem from capacity limits such as rate limits, quotas and concurrency caps, rather than from model bugs or poor accuracy. That is a very different story from the benchmark charts most teams used to obsess over. The amount of data sent per request is also climbing. Across many production estates, median users have more than doubled their token usage, while heavy users have seen volumes grow several‑fold. That growth is both a symptom of more ambitious AI use cases and a direct driver of cost and IT infrastructure stress. You can see the impact most clearly in what many teams now describe as GPU sprawl: fragmented fleets spread across clouds and on‑prem clusters. Some GPUs sit idle while others are consistently saturated, and there is very little correlation between where GPU hours are spent and where they create business value. The result is familiar to anyone who lived through the early adoption of cloud computing - runaway spend, unpredictable performance and capacity crises that appear out of nowhere. How this is playing out in APAC Across Asia‑Pacific, and especially in ASEAN, we're currently seeing structural pressures: AI adoption is accelerating, but operational maturity is uneven. Singapore is further along on governance and observability, driven in part by regulatory expectations and a more mature cloud landscape. Meanwhile, markets such as Indonesia, Malaysia and Thailand are moving very fast on deployment, often pushing AI into customer‑facing services while operational practices catch up. As organizations across these markets roll out multi‑model and agent‑based architectures, they are running into reliability issues, limited visibility and inconsistent model performance. Token usage is increasing quickly, but optimization practices, such as prompt caching and context engineering, are underutilized. That gap between readiness and deployment is already creating operational and cost debt that will be harder to unwind later. The four operational disciplines AI teams need With the evolution of AI resembling the early days of cloud, the good news is that we can predict, at least a little, where things are headed. Now, the question AI leaders should be asking is this: which disciplines distinguish the teams that will cope best with this complexity? In my view, there are four that teams working with AI need to adopt to see sustainable success: 1. Establish visibility and attribution You cannot operate what you cannot see, and AI is no exception. Teams need to see how GPU hours and tokens map to specific applications, teams and use cases, so they can connect that usage to latency, error rates and user impact. That makes it possible to separate business‑critical workloads from background noise, and provide clarity into which services are driving cost or consuming capacity. When usage is visible and attributable on a single view, decisions about where to optimize, protect capacity or dial back become much less emotional and much more data‑driven. 2. Enforce control and guardrails Without guardrails, AI systems will consume as much capacity as you give them. Practical controls include rate limits and budget caps, along with safeguards on agent behavior to stop unbounded retries, loops and poorly bounded workflows from exhausting shared resources. These controls are about making consumption predictable and ensuring that one runaway experiment cannot impact core production services. Without this discipline, AI programs tend to hit economic limits long before they hit technical ones. You end up with impressive prototypes, but unsustainable unit economics. 3. Optimize GPU utilization before scaling supply Most teams reach for more GPUs when what they really have is a utilization problem. GPU instances already account for a significant share of compute costs, and that proportion only grows as organizations push deeper into training and inference at scale. But idle or underutilized GPUs create the sense of a shortage even when there is headroom in the estate. In turn, many teams can see their overall GPU bill climbing, but cannot see which workloads are driving consumption, or pinpoint the steps needed to improve efficiency. What we learned during the early days of cloud is that in these instances, overprovisioning becomes the safest default - but then spend balloons even when there is stranded capacity in the fleet. Treating GPU infrastructure as a first‑class system means tracking utilization so that teams can distinguish genuine capacity shortages from misallocation or fragmentation. Then, they can decide whether to free up capacity or truly add more supply. 4. Design for efficiency at the application layer High AI costs and rates of failure come from how applications are put together, not from the models themselves. Inefficient patterns, poor routing across providers and unoptimized prompts all drive up token usage and increase the risk of timeouts, errors and inconsistent behavior. But with proper visibility into prompts, agents and tools in production, teams can see how requests actually flow through the system and tune for quality, latency and cost in a controlled way. That turns the application layer from a black box into a place where efficient engineering choices are deliberate, measurable and aligned with business outcomes. What leaders should do in the new AI race The early days of cloud taught us that programmability without operational discipline can be as much a liability as an advantage. AI is now at a similar inflection point: the winners will not just be those with access to the most powerful models, but those who treat AI as a long‑term engineering and operations capability. A useful test for any organization is whether it can explain where AI spend goes, how agents behave in production and which workloads it would protect first if capacity were suddenly cut. If the honest answer is "I don't know yet", then the next phase of the AI journey is clear: stop chasing the next model release, and focus on building the operational foundations that will help you scale AI safely and sustainably. We've reviewed and ranked the best business cloud storage services. This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit
[2]
How AI observability helps organizations move from experimentation to production
Managing multiple model complexities to scale AI systems safely and reliably Enterprise AI has entered a new operational phase, moving rapidly from experimentation into production systems integrated into customer experiences, workflows, and software delivery pipelines. However, as organizations operationalize AI, they are also introducing new complexity around infrastructure, governance, debugging, capacity planning, and cost control. This complexity introduces new operational risks. AI systems continuously evolve as prompts change, models are updated, agents become more autonomous, and infrastructure dependencies shift over time. Without end-to-end visibility across the full AI stack, issues related to reliability, latency, output quality, or cost efficiency can gradually slip into production unnoticed: resulting in what many teams refer to as "invisible drift." As AI adoption scales, observability is becoming essential for helping engineering teams maintain operational control, reliability, and resilience in rapidly changing environments. Multi-provider AI brings a new wave of platform engineering challenges Organizations are increasingly adopting multi-model AI strategies rather than relying on a single provider. Recent research shows that more than 70 per cent of organizations now use three or more models in their production environments. This reflects a broader shift toward diversified model libraries, with teams are selecting models based on specific workload requirements such as latency, reasoning ability, operational risk, and cost efficiency. This shift is creating a new generation of platform engineering challenges. AI environments now span evolving ecosystems of models, agents, orchestration frameworks, APIs, vector databases and infrastructure layers. As coding agents accelerate development, organizations are generating more code, dependencies, and operational overhead than teams can realistically manage manually. At the same time, enterprises are accumulating significant LLM technical debt as they rapidly integrate new tools and frameworks. Tool sprawl, fragmented visibility, and constantly evolving AI architectures are making systems harder to govern, troubleshoot, optimize and secure. This makes AI observability essential, providing centralized visibility into model behavior, prompts, latency, hallucinations, token usage, infrastructure performance, and operational bottlenecks across complex multi-model environments. Scaling AI safely, reliably and at speed requires control As organizations race to scale their AI initiatives, operational failures are becoming more visible. Recent analysis shows that two per cent of all LLM calls returned errors, with rate limit issues accounting for almost a third of these (equating to approximately 8.4 million rate limit errors in total). This highlights the operational strain on systems as AI adoption accelerates. At the same time, pressure to remain competitive is pushing organizations to move projects into production before operational controls have fully matured. Scaling too quickly introduces significant reliability, resilience, and governance risks. Real-time observability across the AI stack gives engineering teams the visibility needed to move quickly while maintaining high performance standards. AI agents are adding yet another layer of complexity. Adoption of agent frameworks has doubled in the past year, leading to increased "agent sprawl". These agents autonomously interact with multiple tools, systems, APIs, and datasets, making it harder for organizations to monitor behavior, diagnose faults, manage security risks, and maintain governance controls without deeper telemetry. To manage this complexity, organizations need enterprise-grade observability that delivers end-to-end visibility across the AI stack (from development through to production). This includes visibility into prompts, model interactions, inference pipelines, infrastructure performance, latency, failures, and downstream dependencies. With comprehensive telemetry in place, teams can accelerate AI innovation while improving reliability, security, and operational controls at scale. Four ways observability helps organizations scale AI more reliably Organizations moving AI into production are increasingly treating observability as a foundational operational discipline, rather than simply a monitoring capability. Four practices are becoming particularly important as enterprises scale multi-model AI environments: 1. Managing multi-model environments more effectively Teams are implementing gateways, routing layers, and evaluation frameworks that enhance their ability to select, assess, and manage multi-model environments effectively. These systems enable organizations to compare model behaviors, evaluate outputs, optimize workload placement, and enforce governance policies across various providers. AI observability provides the real-time data needed to support these decisions. 2. Reducing operational overhead and tech debt Centralized visibility across prompts, models, inference pipelines, and infrastructure helps teams manage increasingly distributed environments. Observability reduces operational overhead and limits the accumulation of LLM technical debt as tools and frameworks evolve. 3. Improving agent reliability and preventing infrastructure failures AI observability improves agent reliability and helps organizations eliminate failures caused by capacity constraints and infrastructure bottlenecks. Real-time monitoring of GPU utilization, throughput, latency, request failures, and workload behavior enables engineering teams to identify emerging scaling limitations before they impact production systems or user experiences. 4. Diagnosing faults and understanding agent behavior Detailed tracing across prompts, workflows, APIs, orchestration layers, and infrastructure dependencies provides the operational context needed to investigate anomalies and identify root causes. This is critical for understanding how AI agents behave in real-world production environments. Moving to a state of production-ready AI Enterprise AI is now entering its operational era. As organizations move from experimentation to production, observability becomes the backbone for managing the growing complexity of multi-model architectures, autonomous agents, and distributed AI systems. Without deep visibility into how these systems operate in production, organizations risk increasing operational failures, accumulating technical debt, and allowing invisible drift to undermine performance, reliability and governance over time. AI observability provides the control needed to scale AI safely and effectively. Visibility across models, prompts, infrastructure, agents, and workflows helps teams build more governable, resilient and cost-effective AI systems. Success in the next phase of AI adoption will depend on transforming experimental AI systems into disciplined production platforms that can be continuously evaluated, improved and trusted at scale. We've featured the best data migration tools. This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit
[3]
How to future-proof enterprise operations in the age of invisible AI
Future-proofing enterprise operations with invisible AI foundations At SAP Sapphire in Orlando, Christian Klein put it plainly: "For the mission-critical processes of our customers, almost right just isn't good enough." It was the line that crystallized the Autonomous Enterprise vision, and it is also the line that should reframe how every operations leader thinks about AI for the next eighteen months. Sapphire made one thing unambiguous. AI is becoming visible at the top of the stack. Joule (or your chosen equivalent) is being positioned as the new front door to enterprise software, with more than two hundred agents and over fifty assistants spanning finance, supply chain, procurement, HCM, and customer experience. Users will increasingly describe an outcome and let agents orchestrate the work across SAP and non-SAP systems. That is the visible layer. There is also an invisible one, and it is the one that determines whether any of this actually delivers. Agents only behave as well as the operational substrate they run on. They need systems that are healthy, observable, and consistent enough to act on safely. They need clean process telemetry, automated remediation when things break, and governance that extends across the hybrid landscape most large enterprises actually run. Without that foundation, agentic AI does not reduce operational risk. It multiplies it. This is what future-proofing now means. Less about adopting the latest model, more about building the operational layer underneath it so that agents become a source of measurable outcomes rather than a source of new incidents. The opportunity is significant for organizations that get this right. Two groups are forming. Those who have built the operational readiness to let agents execute, and those who will spend the next two years discovering they have not. Pragmatism in ERP transformation Enterprises are navigating significant transitions in their core systems, and the 2027 SAP ECC end-of-mainstream-maintenance deadline is the most visible forcing function. But the SAPinsider 2026 research surfaces a more interesting signal underneath it. AI readiness is now cited by 43% of organizations as the primary driver of their transformation investment, ranking above the deadline itself. The deadline creates urgency. AI readiness creates direction. For many large enterprises, the preferred approach is not wholesale reinvention but incremental change. Brownfield migration has become a common starting point. It allows organizations to move existing systems to modern platforms while preserving established processes and minimizing disruption. In complex landscapes with extensive integrations and dependencies, that level of continuity is non-negotiable. A brownfield approach also provides a structured path forward. It enables organizations to stabilize their core systems before introducing further innovation, including agentic AI. The transition to cloud ERP software plays a central role here. Managed, scalable environments establish the platform that supports both current operations and future capabilities, with continuous updates and easier integration of new services. This foundation matters particularly for AI. As intelligent features become embedded within enterprise applications, cloud platforms provide the IT infrastructure needed to support them at scale. From advanced analytics to autonomous execution, AI capabilities are increasingly delivered as part of the platform rather than as separate tools. During these transitions, most organizations operate in hybrid environments that combine on-premises and cloud systems. This state can persist for years, introducing complexity in governance, monitoring, and integration. Managing hybrid operations effectively requires clear definitions of roles and responsibilities, and an operational substrate that is observable, automatable, and consistent across the entire landscape. As legacy solutions reach the end of life, organizations are reassessing how they support operations in this mixed environment, and the bar is rising. AI as invisible infrastructure, AI as visible interaction The Sapphire announcements make clear that AI is now operating at two layers, and both have to work. At the interaction layer, AI is becoming the front door. Joule Work, the Autonomous Suite, and the broader agentic stack are designed to let users interact with enterprise systems through conversation and outcomes rather than screens and clicks. This is the visible AI, and it is what most of the industry will spend the next year talking about. At the execution layer, AI is also becoming part of the underlying infrastructure. It will show up in observability, in automated remediation, in capacity and performance management, in the operational disciplines that have always determined whether mission-critical systems actually behave. This is the invisible AI, and it is what determines whether the visible layer delivers. Lacking context is the number one reason enterprise AI projects fail to deliver value. Operational data, process telemetry, and the live state of the landscape are a critical part of that context. Agents that act on stale, incomplete, or unobservable systems will produce confident answers that quietly create new failure modes. Agents that act on a well-instrumented, well-automated estate will deliver the outcomes Sapphire promised. This is why operational readiness is emerging as the real differentiator. Two groups are forming. Those who have built the foundation that lets agents execute reliably, and those who have never closed the gap between AI ambition and operational reality. The divide is not driven by access to technology. AI capabilities are increasingly available across major platforms. The divide is driven by whether the operational layer is ready to absorb them. Positioning for long-term resilience For enterprise and technology leaders, the convergence of cloud transformation and agentic AI presents a clearer opportunity than at any previous point in the SAP cycle. The path forward is not defined by rapid disruption but by deliberate, sustained evolution. Future-proofing now means building the foundation that lets continuous improvement happen safely. It involves modernising core systems, embracing incremental change, and ensuring that emerging capabilities, especially agentic ones, can be integrated into operations without expanding the risk surface. As AI becomes embedded across both the interaction layer and the execution layer, success will depend on how well organisations have prepared for both. The goal is intelligent operations that deliver tangible business outcomes, with AI serving as the enabler at every level of the stack. Resilience, adaptability, and operational discipline are the disciplines that will define long-term competitiveness in the autonomous enterprise era. We list the best business cloud storage services. This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit
[4]
Designing Observable and Reliable Generative AI Systems: A Platform Engineering Approach for Enterprise-Scale Deployment
Traditional monitoring can tell a company its servers are healthy and still leave it blind to why an AI agent made the wrong call. An engineer who builds the layer underneath these systems explains why observability and evaluation are becoming the foundation of trustworthy AI. An AI agent finishes its task. Every call it made returned cleanly, the logs are green, the dashboards show no errors. And the answer it gave is wrong. Nobody in the room can say why. That gap, between a system that looks healthy and a system that is actually doing its job, is the problem Ayush Jain has spent his career chasing into its hiding places. A software engineer who has worked on large-scale search, machine learning, and distributed systems, Jain belongs to a small group of people who build the unglamorous layer beneath enterprise AI: the pipelines, traces, and telemetry that let anyone see what an autonomous system is really doing. "Traditional monitoring can confirm that infrastructure is healthy," he says, "but it often fails to explain why an AI system made a particular decision or produced an unexpected outcome." For two decades, software told on itself. When a deterministic program broke, it left an error log, an exception, a failed health check. The failure had a fingerprint. Agentic systems do not work that way. They reason in probabilities, pick their own tools, pull in outside context, and plan across multiple steps, and they introduce entirely new ways to fail. "An agent may successfully execute API calls while still producing incorrect outcomes due to flawed reasoning, poor tool selection, or incomplete context," Jain says. The pipes all work. The judgment does not. And the standard monitoring stack, built to watch the pipes, never notices. Jain learned this at the scale where small problems become expensive ones. At Bloomberg, he contributed to search and ranking systems that supported hundreds of thousands of user interactions across more than one hundred million documents. A search result that drifts a few points in relevance does not throw an exception, but at that volume, it quietly degrades the experience for thousands of people. To catch it, he helped build observability pipelines that processed millions of telemetry events a day, turning the raw exhaust of a running system into real-time signal about search quality, user behavior, performance, and anomalies. The point was not to confirm the machines were up. The point was to close the distance between an idea that worked in an experiment and a system that held up in production. That distance has only grown as the work moved from search to agents. More recently, at Microsoft, Jain has focused on AI agent platform infrastructure, where the central difficulty is that failures are behavioral rather than infrastructural. An agent that calls every API correctly can still choose the wrong tool, reason poorly, or act on a thin slice of context and arrive somewhere it should not. So the platform has to watch behavior, not just uptime. The systems he has worked on capture execution traces, monitor how workflows actually resolve, collect telemetry about the agent's conduct, and let teams run the same task across different configurations to see which one holds. It is the difference between knowing a process finished and knowing it finished for the right reasons. The throughline across both chapters is a conviction that the industry is quietly reorganizing itself around. "I believe the industry is shifting from optimizing model performance to optimizing AI systems," Jain says. For years the scoreboard was the model: a higher benchmark, a better accuracy number. But a benchmark is a single moment, and an agent makes a sequence of decisions across many steps and external tools, where accuracy and precision stop describing the thing that matters. In his view, "agent behavior must be evaluated continuously rather than treated as a one-time model validation exercise." The metrics that count look less like a test score and more like an operations report: task completion, reasoning quality, tool effectiveness, cost, safety, alignment with what the business actually wanted. Jain frames the moment with an analogy his peers recognize. "Enterprise AI is entering a phase similar to what cloud computing experienced during the rise of Site Reliability Engineering," he says. Cloud services eventually stopped being judged only on whether they were fast and started being judged on whether they stayed up, measured in uptime and latency that everyone could see. He expects AI to follow the same arc, with a new vocabulary of behavioral measures: hallucination rates, reasoning consistency, retrieval effectiveness, policy adherence, workflow completion. Out of that shift, he believes, a discipline is forming. "I believe AI Reliability Engineering will become a foundational discipline," he says, and the companies that build it into their platforms early will hold a real advantage over the ones that bolt it on after something breaks. What he is describing is a change in where the hard work of AI lives. The attention has long gone to the model and the clever output. Jain's argument is that the durable problem sits one layer down, in whether anyone can explain, measure, and trust what the system did. "The challenge is no longer simply generating intelligent outputs," he says. "It is creating systems that make those outputs explainable, measurable, and trustworthy at scale." He expects the next generation of enterprise platforms to make observability and evaluation first-class parts of the architecture rather than afterthoughts, with online evaluation, execution traces, and feedback loops catching trouble before a user ever does. His closing position is plain. "Making AI systems measurable, explainable, and reliable is essential for successful enterprise adoption at scale," he says. The companies treating that as the real engineering challenge are the ones whose AI will still be trusted a year after the demo. The rest are flying on green dashboards, and learning the hard way that healthy is not the same as right.
Share
Copy Link
The AI industry is experiencing a fundamental shift reminiscent of early cloud computing days. As organizations move AI from experimentation into production, operational challenges are overtaking model performance as the primary concern. Nearly 1 in 20 AI requests fail at scale, with most failures stemming from capacity limits rather than model accuracy, revealing that the critical competition has moved from building the best model to operating AI reliably and efficiently.
The AI industry is experiencing a transformation that veterans of cloud computing will recognize immediately. After two years dominated by model benchmarks and performance competitions,
AI operations
are now facing the same operational complexity that defined early cloud adoption. Real-world telemetry from thousands of production systems reveals a stark reality: nearly 1 in 20 AI requests fails once applications reach scale, and the majority of these failures stem from capacity limits such as rate limits, quotas, and concurrency caps rather than model bugs or poor accuracy1
. This shift marks a critical inflection point where the race has moved from who has the best model to who can operate AI reliably at scale.
Source: TechRadar
As organizations operationalize
AI in production
, they're introducing complexity around infrastructure, governance, debugging, capacity planning, and AI cost governance2
. The amount of data sent per request is climbing dramatically, with median users more than doubling their token usage while heavy users see volumes grow several-fold1
. This growth drives both ambitious use cases and mounting infrastructure stress. The result manifests as GPU sprawl: fragmented fleets spread across clouds and on-premises clusters where some GPUs sit idle while others remain consistently saturated, with little correlation between where GPU utilization hours are spent and where they create business value1
.Research shows that more than 70 percent of organizations now use three or more models in their production environments, reflecting a shift toward diversified model libraries where teams select models based on specific workload requirements such as latency, reasoning ability, operational risk, and cost efficiency
2
. Thesemulti-model environments
create a new generation of platform engineering challenges. AI environments now span evolving ecosystems of models, agents, orchestration frameworks, APIs, vector databases, and infrastructure layers. Enterprises are accumulating significant technical debt as they rapidly integrate new tools and frameworks, with tool sprawl and fragmented visibility making systems harder to govern, troubleshoot, and secure2
.
Source: TechRadar
Traditional monitoring can confirm that infrastructure is healthy but often fails to explain why an AI system made a particular decision or produced an unexpected outcome
4
.Generative AI systems
introduce entirely new failure modes because they reason in probabilities, pick their own tools, and plan across multiple steps. An agent may successfully execute API calls while still producing incorrect outcomes due to flawed reasoning, poor tool selection, or incomplete context4
. Without end-to-endAI observability
, issues related to AI reliability, latency, output quality, or cost efficiency can gradually slip into production unnoticed, resulting in what teams call invisible drift2
.
Source: TechRadar
Adoption of agent frameworks has doubled in the past year, leading to increased agent sprawl
2
. These agents autonomously interact with multiple tools, systems, APIs, and datasets, making it harder for organizations to monitor behavior, diagnose faults, manage security risks, and maintain governance controls without deeper telemetry. Analysis reveals that 2 percent of all LLM calls returned errors, with rate limit issues accounting for almost a third of these failures, equating to approximately 8.4 million rate limit errors in total2
. This highlights the operational strain on systems as AI adoption accelerates.Related Stories
The
AI operational challenges
facing teams today require four critical disciplines. First, establish visibility and attribution so GPU hours and token usage map to specific applications, teams, and use cases, connecting usage to latency, error rates, and user impact1
. Second, enforce control and guardrails including rate limits and budget caps, along with safeguards on agent behavior to stop unbounded retries and loops from exhausting shared resources1
. Third, optimize GPU utilization before scaling supply, as most teams reach for more GPUs when they actually have a utilization problem1
. Fourth, manage multi-model environments more effectively through gateways, routing layers, and evaluation frameworks2
.Experts believe enterprise AI is entering a phase similar to what cloud computing experienced during the rise of Site Reliability Engineering
4
. The industry is shifting from optimizing model performance to optimizing AI systems, where agent behavior must be evaluated continuously rather than treated as a one-time model validation exercise4
. As SAP's Christian Klein emphasized at Sapphire, for mission-critical processes, almost right just isn't good enough3
. Organizations that build operational readiness into their platforms early will hold a significant advantage as agentic AI becomes embedded in enterprise applications3
.Summarized by
Navi
[1]
06 Apr 2026•Technology

16 Jun 2026•Technology

20 Jan 2026•Policy and Regulation

1
Technology

2
Policy and Regulation

3
Policy and Regulation
