3 Sources
[1]
Google launches 'implicit caching' to make accessing its latest AI models cheaper | TechCrunch
Google is rolling out a feature in its Gemini API that the company claims will make its latest AI models cheaper for third-party developers. Google calls the feature "implicit caching" and says it can deliver 75% savings on "repetitive context" passed to models via the Gemini API. It supports Google's Gemini 2.5 Pro and 2.5 Flash models. That's likely to be welcome news to developers as the cost of using frontier models continues to grow. Caching, a widely adopted practice in the AI industry, reuses frequently accessed or pre-computed data from models to cut down on computing requirements and cost. For example, caches can store answers to questions users often ask of a model, eliminating the need for the model to recreate answers to the same request. Google previously offered model prompt caching, but only explicit prompt caching, meaning devs had to define their highest-frequency prompts. While cost savings are supposed to be guaranteed, explicit prompt caching often involved a lot of manual work. Some developers weren't pleased with how Google's explicit caching implementation worked for Gemini 2.5 Pro specifically, which they said caused surprisingly large API bills. Complaints reached a fever pitch in the past week, prompting the Gemini team to apologize and pledge to make changes. In contrast to explicit caching, implicit caching is automatic. Enabled by default for Gemini 2.5 models, it passes on cost savings if a Gemini API request to a model hits a cache. "[W]hen you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it's eligible for a cache hit," explained Google in a blog post. "We will dynamically pass cost savings back to you." The minimum prompt token count for implicit caching is 1,024 for 2.5 Flash and 2,048 for 2.5 Pro, according to Google's developer documentation, which is not a terribly big amount, meaning it shouldn't take much to trigger these automatic savings. Tokens are the raw bits of data models work with, with a thousand tokens equivalent to about 750 words. Given that Google's last claims of cost savings from caching ran afoul, there are some buyer-beware areas in these new claims. For one, Google recommends that developers keep repetitive context at the beginning of requests to increase the chances of implicit cache hits. Context that might change from request to request should be appended at the end, the company says. For another, Google didn't offer any third-party verification that the new implicit caching system would deliver the promised automatic savings. So we'll have to see what early adopters say.
[2]
Implicit caching aims to slash Gemini API costs by 75%
Google has launched a new feature in its Gemini API called "implicit caching," which the company claims can reduce costs by 75% for third-party developers using its latest AI models, Gemini 2.5 Pro and 2.5 Flash. The feature automatically enables cost savings when a Gemini API request to a model hits a cache, eliminating the need for manual configuration required by the previous explicit caching method. According to Google, implicit caching is triggered when a request shares a common prefix with a previous request, and the minimum prompt token count required is 1,024 for 2.5 Flash and 2,048 for 2.5 Pro. Logan Kilpatrick, a member of the Gemini team, announced the launch on May 8, 2025, stating that the feature can deliver significant cost savings for developers. Google recommends that developers place repetitive context at the beginning of requests and append changing context at the end to increase the chances of implicit cache hits. Caching is a widely adopted practice in the AI industry that reuses frequently accessed or pre-computed data to cut down on computing requirements and costs. Google's previous explicit caching method required developers to define high-frequency prompts manually, which often resulted in extra work and sometimes surprisingly large API bills for some users. Some developers had expressed dissatisfaction with the explicit caching implementation for Gemini 2.5 Pro, prompting the Gemini team to apologize and pledge to make changes. The new implicit caching feature addresses these concerns by automating the caching process and passing on cost savings to developers when a cache hit occurs. While Google claims that implicit caching can deliver 75% cost savings, the company did not provide third-party verification of the feature's effectiveness. As such, the actual cost savings may vary depending on how developers use the feature.
[3]
How to Cut AI Model Costs by 75% with Gemini AI's Implicit Caching
What if you could slash your AI model costs by a staggering 75% without sacrificing performance or efficiency? For many businesses and developers, the rising expense of running advanced AI models has become a significant hurdle, especially when handling repetitive tasks or processing large-scale data. But with Gemini AI's latest innovation -- implicit caching -- this challenge is being turned on its head. Imagine a system that automatically identifies redundant inputs and applies discounts without requiring you to lift a finger. It's not just a cost-cutting measure; it's a fantastic option for anyone looking to streamline workflows and maximize the value of their AI investments. In this perspective, Sam Witteveen explores how implicit caching works, why it's exclusive to Gemini AI's 2.5 reasoning models, and how it can transform the way you approach AI-driven projects. From understanding token thresholds to using reusable content in your prompts, you'll uncover practical strategies to optimize your workflows and reduce expenses. Whether you're managing repetitive queries, analyzing extensive datasets, or seeking long-term solutions for static data, this feature offers a seamless path to efficiency. The potential to save big while maintaining high performance isn't just a possibility -- it's a reality waiting to be unlocked. Implicit caching is an advanced functionality exclusive to Gemini AI's 2.5 reasoning models, including the Flash and Pro variants. It identifies repeated prefixes in your prompts and applies discounts automatically, streamlining workflows without requiring user intervention. This makes it particularly effective for tasks involving repetitive queries or foundational data. For example, if your project frequently queries the same base information, implicit caching detects this redundancy and applies a 75% discount on token costs. However, to activate this feature, your prompts must meet specific token thresholds: These thresholds ensure that the system can efficiently process and cache repeated content, making it especially beneficial for high-volume tasks where cost savings are critical. While implicit caching is ideal for dynamic and repetitive queries, explicit caching remains a valuable tool for projects that require long-term storage of static data. Unlike implicit caching, explicit caching involves manual setup, allowing users to store and retrieve predefined datasets as needed. For instance, if you're working on a project that involves analyzing a fixed set of documents over an extended period, explicit caching ensures consistent access to this data without incurring additional token costs. However, the manual configuration process may require more effort compared to the automated nature of implicit caching. Explicit caching is particularly useful for projects where data consistency and long-term accessibility are priorities. Browse through more resources below from our in-depth content covering more areas on Gemini AI. Efficient use of context windows is another key strategy for reducing costs with Gemini AI. By placing reusable content at the beginning of your prompts, you enable the system to recognize and cache it effectively. This approach not only minimizes token usage but also enhances the overall efficiency of your queries. Gemini AI's 2.5 models are specifically optimized to handle large context windows, making them well-suited for tasks involving substantial inputs such as documents or videos. However, it's important to note that while text and video inputs are supported, YouTube videos are currently excluded from caching capabilities. Testing your specific use case is essential to ensure compatibility and to fully use the system's capabilities. To maximize savings and optimize workflows with Gemini AI, consider implementing the following strategies: By adopting these practices, you can significantly reduce API costs while maintaining high levels of performance and efficiency in your AI-driven projects. While implicit caching offers substantial benefits, it is important to understand its limitations. This feature is exclusive to Gemini AI's 2.5 reasoning models and is not available for earlier versions. Additionally, YouTube video caching is not supported, which may limit its applicability for certain multimedia projects. To address these limitations, it is crucial to evaluate your specific project requirements and test the caching functionality before fully integrating it into your workflows. Refining your prompt design and using the system's ability to handle large-scale inputs can help you overcome these challenges and maximize the potential of implicit caching. Gemini AI's implicit caching feature for its 2.5 reasoning models represents a significant step forward in cost optimization. By automatically applying discounts for repeated prompt prefixes, this functionality simplifies token management and delivers substantial savings. Whether you're processing repetitive queries, analyzing large documents, or working with video inputs, these updates provide a practical and efficient way to reduce expenses. With strategic implementation and careful planning, you can cut your AI model costs by up to 75%, making Gemini AI a more accessible and cost-effective tool for a wide range of projects.
Share
Copy Link
Google launches 'implicit caching' for its Gemini API, aiming to reduce costs for developers using its latest AI models by up to 75%. This automatic feature is set to make accessing advanced AI models more affordable and efficient.
Google has introduced a groundbreaking feature called 'implicit caching' for its Gemini API, promising to significantly reduce costs for developers using its latest AI models. This innovation aims to make accessing cutting-edge AI technology more affordable and efficient, potentially transforming the landscape of AI development and application 1.
Implicit caching is an automatic feature enabled by default for Gemini 2.5 models, including Gemini 2.5 Pro and 2.5 Flash. The system identifies repeated prefixes in API requests and applies discounts automatically, eliminating the need for manual configuration 2.
Key aspects of the feature include:
Previously, Google offered explicit prompt caching, which required developers to manually define high-frequency prompts. This method often involved substantial manual work and sometimes resulted in unexpectedly large API bills for some users 1.
Implicit caching addresses these issues by:
To maximize the benefits of implicit caching, Google recommends:
These strategies can help increase the chances of implicit cache hits and optimize overall efficiency.
While implicit caching offers significant advantages, it's important to note some limitations:
Developers are advised to test the feature with their specific use cases to ensure compatibility and maximize potential savings.
The introduction of implicit caching could have far-reaching effects on the AI industry:
As the cost of using frontier models continues to grow, features like implicit caching may play a crucial role in making AI technology more accessible and economically viable for a broader range of developers and businesses.
NASA and IBM have developed Surya, an open-source AI model that can predict solar flares and space weather, potentially improving the protection of Earth's critical infrastructure from solar storms.
5 Sources
Technology
1 hr ago
5 Sources
Technology
1 hr ago
Meta introduces an AI-driven voice translation feature for Facebook and Instagram creators, enabling automatic dubbing of content from English to Spanish and vice versa, with plans for future language expansions.
8 Sources
Technology
17 hrs ago
8 Sources
Technology
17 hrs ago
OpenAI CEO Sam Altman reveals plans for GPT-6, focusing on memory capabilities to create more personalized and adaptive AI interactions. The upcoming model aims to remember user preferences and conversations, potentially transforming the relationship between humans and AI.
2 Sources
Technology
17 hrs ago
2 Sources
Technology
17 hrs ago
Chinese AI companies DeepSeek and Baidu are making waves in the global AI landscape with their open-source models, challenging the dominance of Western tech giants and potentially reshaping the AI industry.
2 Sources
Technology
1 hr ago
2 Sources
Technology
1 hr ago
A comprehensive look at the emerging phenomenon of 'AI psychosis', its impact on mental health, and the growing concerns among experts and tech leaders about the psychological risks associated with AI chatbots.
3 Sources
Technology
1 hr ago
3 Sources
Technology
1 hr ago