2 Sources
2 Sources
[1]
Google gives enterprises new controls to manage AI inference costs and reliability
Flex and Priority tiers let Gemini API developers route workloads by criticality through a single interface -- but they may not always get what they ask for. Google has added two new service tiers to the Gemini API that enable enterprise developers to control the cost and reliability of AI inference depending on how time-sensitive a given workload is. While the cost of training large language models for artificial intelligence has been a concern in the past, the focus of attention is increasingly moving to inferencing, or the cost of using those models. The new tiers, called Flex Inference and Priority Inference, address a problem that has grown more acute as enterprises move beyond simple AI chatbots into complex, multi-step agentic workflows, the company said in a blog post published Thursday.
[2]
New ways to balance cost and reliability in the Gemini API
Sorry, your browser doesn't support embedded videos, but don't worry, you can download it and watch it with your favorite video player! Today, we are adding two new service tiers to the Gemini API: Flex and Priority. These new options give you granular control over cost and reliability through a single, unified interface. As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic: Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API. Flex and Priority help to bridge this gap. You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints. This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers. Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing.
Share
Share
Copy Link
Google has introduced two new service tiers to the Gemini API—Flex Inference and Priority Inference—giving enterprise developers granular control over cost and reliability for AI workloads. The update addresses growing concerns about AI inference expenses as companies move beyond simple chatbots into complex, multi-step agentic workflows that require different performance levels.
Google has unveiled two new service tiers for the Gemini API, fundamentally changing how enterprise developers manage AI inference costs and reliability
2
. The additions, called Flex Inference and Priority Inference, provide granular control over cost and reliability through a single, unified interface, eliminating the need for developers to split their architecture between different serving methods2
.
Source: Google
The new tiers let enterprise developers route workloads by criticality, addressing a problem that has intensified as enterprises move beyond simple AI chatbots into complex, multi-step agentic workflows
1
. While training large language models has historically been a major expense, attention is increasingly shifting to inferencing—the cost of actually using those models in production environments1
.Flex Inference serves as Google's cost-optimized tier, specifically designed for latency-tolerant workloads without the overhead of batch processing
2
. This tier is ideal for background jobs that don't require immediate responses, allowing developers to significantly reduce expenses on non-critical AI workloads.Priority Inference, on the other hand, caters to interactive jobs that demand immediate processing and consistent performance. Together, these tiers enable developers to route different types of logic through standard synchronous endpoints, rather than managing the complexity of asynchronous job management systems .
As AI evolves from simple chat into complex, autonomous agents, developers typically need to manage two distinct types of logic with different performance requirements
2
. Until now, supporting both meant splitting architecture between standard synchronous serving and the asynchronous Batch API, creating operational complexity for enterprises2
.The new service tiers bridge this gap by allowing developers to maintain a unified API approach while still achieving the economic and performance benefits of specialized tiers. This matters because it simplifies infrastructure management while giving enterprises the flexibility to optimize spending based on actual business needs rather than technical constraints.
Related Stories
For enterprises scaling AI deployments, these new tiers offer a practical way to control expenses without sacrificing performance where it counts. Companies can now allocate their AI budgets more strategically, directing premium resources to customer-facing interactions while using cost-effective options for data processing, content generation, and other background tasks.
The timing is significant as organizations grapple with the reality that AI inference costs can quickly spiral as usage scales. By providing this level of control over cost and reliability, Google positions the Gemini API as a more economically sustainable option for long-term enterprise AI strategies. Developers should watch how pricing structures evolve and whether competitors introduce similar tiered approaches to manage AI inference costs across different workload types.
Summarized by
Navi
31 Jan 2025•Technology

08 Apr 2025•Technology

30 Mar 2025•Technology

1
Technology

2
Science and Research

3
Startups
