4 Sources
4 Sources
[1]
'Gemini 2.5 Computer Use' model enters preview with strong web, Android performance
Google is now letting developers preview the Gemini 2.5 Computer Use model behind Project Mariner and agentic features in AI Mode. This "specialized model" can interact with graphical user interfaces, specifically browsers and websites. There are several steps that happen in a loop "until the task is complete." Other UI actions supported by the model include going back/forward, searching the web, navigating to a specific URL, cursor hovering, keyboard combinations, scrolling, and drag/drop. Google shared two examples (at 3X speed) with the following prompts: "From https://tinyurl.com/pet-care-signup, get all details for any pet with a California residency and add them as a guest in my spa CRM at https://pet-luxe-spa.web.app/. Then, set up a follow up visit appointment with the specialist Anima Lavar for October 10th anytime after 8am. The reason for the visit is the same as their requested treatment." "My art club brainstormed tasks ahead of our fair. The board is chaotic and I need your help organizing the tasks into some categories I created. Go to sticky-note-jam.web.app and ensure notes are clearly in the right sections. Drag them there if not." Gemini 2.5 Computer Use is "primarily optimized for web browsers." However, Google has an "AndroidWorld" benchmark that "demonstrates strong promise for mobile UI control tasks," while it's "not yet optimized for desktop OS-level control." Google demonstrated strong performance across web and mobile control benchmarks when compared to Claude and OpenAI's offering, as well as "leading quality for browser control at the lowest latency." This model is built on Gemini 2.5 Pro's visual understanding and reasoning capabilities. Google says "versions of this model" power Project Mariner and AI Mode's agentic capabilities. It's been used internally for UI testing to speed up software development, while Google has an early access program for third-party developers building assistants and workflow automation tools. Gemini 2.5 Computer Use is available in public preview today through Gemini API in Google AI Studio and Vertex AI.
[2]
Google's AI can now surf the web for you, click on buttons, and fill out forms with Gemini 2.5 Computer Use
Some of the largest providers of large language models (LLMs) have sought to move beyond multimodal chatbots -- extending their models out into "agents" that can actually take more actions on behalf of the user across websites. Recall OpenAI's ChatGPT Agent (formerly known as "Operator") and Anthropic's Computer Use, both released over the last two years. Now, Google is getting into that same game as well. Today, the search giant's DeepMind AI lab subsidiary unveiled a new, fine-tuned and custom-trained version of its powerful Gemini 2.5 Pro LLM known as "Gemini 2.5 Pro Computer Use," which can use a virtual browser to surf the web on your behalf, retrieve information, fill out forms, and even take actions on websites -- all from a user's single text prompt. "These are early days, but the model's ability to interact with the web - like scrolling, filling forms + navigating dropdowns - is an important next step in building general-purpose agents," said Google CEO Sundar Pichai, as part of a longer statement on the social network, X. The model is not available for consumers directly from Google, though. Instead, Google partnered with another company, Browserbase, founded by former Twilio engineer Paul Klein in early 2024, which offers virtual "headless" web browser specifically for use by AI agents and applications. (A "headless" browser is one that doesn't require a graphical user interface, or GUI, to navigate the web, though in this case and others, Browserbase does show a graphical representation for the user). Users can demo the new Gemini 2.5 Computer Use model directly on Browserbase here and even compare it side-by-side with the older, rival offerings from OpenAI and Anthropic in a new "Browser Arena" launched by the startup (though only one additional model can be selected alongside Gemini at a time). For AI builders and developers, it's being made as a raw, albeit propreitary LLM through the Gemini API in Google AI Studio for rapid prototyping, and Google Cloud's Vertex AI model selector and applications building platform. The new offering builds on the capabilities of Gemini 2.5 Pro, released back in March 2025 but which has been updated significantly several times since then, with a specific focus on enabling AI agents to perform direct interactions with user interfaces, including browsers and mobile applications. Overall, it appears Gemini 2.5 Computer Use is designed to let developers create agents that can complete interface-driven tasks autonomously -- such as clicking, typing, scrolling, filling out forms, and navigating behind login screens. Rather than relying solely on APIs or structured inputs, this model allows AI systems to interact with software visually and functionally, much like a human would. Brief User Hands-On Tests In my brief, unscientific initial hands-on tests on the Browserbase website, Gemini 2.5 Computer Use successfully navigate to Taylor Swift's official website as instructed and provided me a summary of what was being sold or promoted at the top -- a special edition of her newest album, "The Life of A Showgirl." In another test, I asked Gemini 2.5 Computer Use to search Amazon for highly rated and well-reviewed solar lights I could stake into my back yard, and I was delighted to watch as it successfully completed a Google Search Captcha designed to weed out non-human users ("Select all the boxes with a motorcycle.") It did so in a matter of seconds. However, once it got through there, it stalled and was unable to complete the task, despite serving up a "task competed" message. I should also note here that while the ChatGPT agent from OpenAI and Anthropic's Claude can create and edit local files -- such as PowerPoint presentations, spreadsheets, or text documents -- on the user's behalf, Gemini 2.5 Computer Use does not currently offer direct file system access or native file creation capabilities. Instead, it is designed to control and navigate web and mobile user interfaces through actions like clicking, typing, and scrolling. Its output is limited to suggested UI actions or chatbot-style text responses; any structured output like a document or file must be handled separately by the developer, often through custom code or third-party integrations. Performance Benchmarks Google says Gemini 2.5 Computer Use has demonstrated leading results in multiple interface control benchmarks, particularly when compared to other major AI systems including Claude Sonnet and OpenAI's agent-based models. Evaluations were conducted via Browserbase and Google's own testing. Some highlights include: * Online-Mind2Web (Browserbase): 65.7% for Gemini 2.5 vs. 61.0% (Claude Sonnet 4) and 44.3% (OpenAI Agent) * WebVoyager (Browserbase): 79.9% for Gemini 2.5 vs. 69.4% (Claude Sonnet 4) and 61.0% (OpenAI Agent) * AndroidWorld (DeepMind): 69.7% for Gemini 2.5 vs. 62.1% (Claude Sonnet 4); OpenAI's model could not be measured due to lack of access * OSWorld: Currently not supported by Gemini 2.5; top competitor result was 61.4% In addition to strong accuracy, Google reports that the model operates at lower latency than other browser control solutions -- a key factor in production use cases like UI automation and testing. How It Works Agents powered by the Computer Use model operate within an interaction loop. They receive: * A user task prompt * A screenshot of the interface * A history of past actions The model analyzes this input and produces a recommended UI action, such as clicking a button or typing into a field. If needed, it can request confirmation from the end user for riskier tasks, such as making a purchase. Once the action is executed, the interface state is updated and a new screenshot is sent back to the model. The loop continues until the task is completed or halted due to an error or a safety decision. The model uses a specialized tool called , and it can be integrated into custom environments using tools like Playwright or via the Browserbase demo sandbox. Use Cases and Adoption According to Google, teams internally and externally have already started using the model across several domains: * Google's payments platform team reports that Gemini 2.5 Computer Use successfully recovers over 60% of failed test executions, reducing a major source of engineering inefficiencies. * Autotab, a third-party AI agent platform, said the model outperformed others on complex data parsing tasks, boosting performance by up to 18% in their hardest evaluations. * Poke.com, a proactive AI assistant provider, noted that the Gemini model often operates 50% faster than competing solutions during interface interactions. The model is also being used in Google's own product development efforts, including in Project Mariner, the Firebase Testing Agent, and AI Mode in Search. Safety Measures Because this model directly controls software interfaces, Google emphasizes a multi-layered approach to safety: * A per-step safety service inspects every proposed action before execution. * Developers can define system-level instructions to block or require confirmation for specific actions. * The model includes built-in safeguards to avoid actions that might compromise security or violate Google's prohibited use policies. For example, if the model encounters a CAPTCHA, it will generate an action to click the checkbox but flag it as requiring user confirmation, ensuring the system does not proceed without human oversight. Technical Capabilities The model supports a wide array of built-in UI actions such as: * , , , , and more * User-defined functions can be added to extend its reach to mobile or custom environments * Screen coordinates are normalized (0-1000 scale) and translated back to pixel dimensions during execution It accepts image and text input and outputs text responses or function calls to perform tasks. The recommended screen resolution for optimal results is 1440x900, though it can work with other sizes. API Pricing Remains Almost Identical to Gemini 2.5 Pro The pricing for Gemini 2.5 Computer Use aligns closely with the standard Gemini 2.5 Pro model. Both follow the same per-token billing structure: input tokens are priced at $1.25 per one million tokens for prompts under 200,000 tokens, and $2.50 per million tokens for prompts longer than that. Output tokens follow a similar split, priced at $10.00 per million for smaller responses and $15.00 for larger ones. Where the models diverge is in availability and additional features. Gemini 2.5 Pro includes a free tier that allows developers to use the model at no cost, with no explicit token cap published, though usage may be subject to rate limits or quota constraints depending on the platform (e.g. Google AI Studio). This free access includes both input and output tokens. Once developers exceed their allotted quota or switch to the paid tier, standard per-token pricing applies. In contrast, Gemini 2.5 Computer Use is available exclusively through the paid tier. There is no free access currently offered for this model, and all usage incurs token-based charges from the outset. Feature-wise, Gemini 2.5 Pro supports optional capabilities like context caching (starting at $0.31 per million tokens) and grounding with Google Search (free for up to 1,500 requests per day, then $35 per 1,000 additional requests). These are not available for Computer Use at this time. Another distinction is in data handling: output from the Computer Use model is not used to improve Google products in the paid tier, while free-tier usage of Gemini 2.5 Pro contributes to model improvement unless explicitly opted out. Overall, developers can expect similar token-based costs across both models, but they should consider tier access, included capabilities, and data use policies when deciding which model fits their needs.
[3]
Introducing the Gemini 2.5 Computer Use model
Summaries were generated by Google AI. Generative AI is experimental. Earlier this year, we mentioned that we're bringing computer use capabilities to developers via the Gemini API. Today, we are releasing the Gemini 2.5 Computer Use model, our new specialized model built on Gemini 2.5 Pro's visual understanding and reasoning capabilities that powers agents capable of interacting with user interfaces (UIs). It outperforms leading alternatives on multiple web and mobile control benchmarks, all with lower latency. Developers can access these capabilities via the Gemini API in Google AI Studio and Vertex AI. While AI models can interface with software through structured APIs, many digital tasks still require direct interaction with graphical user interfaces, for example, filling and submitting forms. To complete these tasks, agents must navigate web pages and applications just as humans do: by clicking, typing and scrolling. The ability to natively fill out forms, manipulate interactive elements like dropdowns and filters, and operate behind logins is a crucial next step in building powerful, general-purpose agents. The model's core capabilities are exposed through the new 'computer_use' tool in the Gemini API and should be operated within a loop. Inputs to the tool are the user request, screenshot of the environment, and a history of recent actions. The input can also specify whether to exclude functions from the full list of supported UI actions or specify additional custom functions to include.
[4]
Google's Gemini 2.5 Computer Use model can navigate the web like a human - SiliconANGLE
Google's Gemini 2.5 Computer Use model can navigate the web like a human Google LLC has just announced a new version of its Gemini large language model that's able to navigate the web through a browser and interact with various websites, meaning it can perform tasks like searching for information or buying things, without human supervision. The model in question is called Gemini 2.5 Computer Use, and it uses a combination of visual understanding and reasoning to analyze user's requests and carry out tasks in the browser. It will complete all of the actions required to fulfill that task, such as clicking, typing, scrolling, manipulating dropdown menus and filling out and submitting forms, just as a human can do. In a blog post, Google's DeepMind research outfit said Gemini 2.5 Computer Use is based on the Gemini 2.5 Pro LLM, and explained that earlier versions of the model have been used to power earlier agentic features it has launched in tools such as AI Mode and Project Mariner. But this is the first time the complete model has been made available. The company explained that each request kicks off a "loop" that involves the model go through various steps until it's considered complete. First, the user sends a request to the model, which can also include screenshots of the website in question and a history of recent actions. Then, Gemini 2.5 Computer Use will analyze those inputs and generate a response, which will typically be a "function call representing one of the UI actions such as clicking or typing." Client-side code will then execute the required action, and after this is done, a new screenshot of the graphical user interface and the current website will be sent back to the model as a function response. Google posted a few demonstration videos showing the computer use tool in action, noting that they are sped up by three-times. The first video is based on the following prompt: "From https://tinyurl.com/pet-care-signup, get all details for any pet with a California residency and add them as a guest in my spa CRM at https://pet-luxe-spa.web.app/. Then, set up a follow up visit appointment with the specialist Anima Lavar for October 10th anytime after 8am. The reason for the visit is the same as their requested treatment." Google is somewhat late to the party here. Just yesterday, OpenAI revealed a number of new applications for ChatGPT, enhancing the capabilities of its ChatGPT Agent feature that's designed to complete various tasks on user's behalf using a computer. Anthropic PBC first released a version of its flagship Claude AI model that has the ability to use a computer last year. Not only is Google's computer use model late, but it's not as comprehensive. Unlike OpenAI's and Anthropic's tools, it's only able to access a web browser, rather than the entire computer operating system. "It's not yet optimized for desktop OS-level control, and currently supports 13 actions," the company explained. Still, DeepMind's researchers say their focus on getting Gemini 2.5 Computer Use to work specifically in web browsers has paid off in terms of its performance. They claim that it "outperforms leading alternatives on multiple web and mobile benchmarks", including Online-Mind2Web and WebVoyager. They noted that it's primarily optimized for web browsers and so it performs better in them, but even so, it still outperformed its peers on the AndroidWorld benchmark, which demonstrates "strong promise for mobile UI control tasks," the researchers said. In addition, they also claimed that Gemini 2.5 Computer Use is superior in terms of browser control at the lowest latency, based on its performance on the Browserbase harness for Online-Mind2Web. Here's a second example of the model in action, using a different prompt: "My art club brainstormed tasks ahead of our fair. The board is chaotic and I need your help organizing the tasks into some categories I created. Go to sticky-note-jam.web.app and ensure notes are clearly in the right sections. Drag them there if not." DeepMind's researchers are making Gemini 2.5 Computer Use available to developers through the Google AI Studio and Vertex AI, and pricing aligns pretty closely with the standard Gemini 2.5 Pro model. They follow the same token-based billing structure, with input tokens priced at $1.25 per one million tokens for prompts with under 200,000 tokens, rising to $2.50 per million tokens for longer prompts. Output tokens are priced similarly for both models, at $10 per million for shorter responses and $15 per million for longer ones. The only real difference is that, while Gemini 2.5 Pro is available for free with an explicit token cap, Gemini 2.5 Computer Use does not offer any free tier, and so users must pay from the outset to access it.
Share
Share
Copy Link
Google introduces a new AI model capable of interacting with web interfaces, outperforming competitors in various benchmarks. This development marks a significant step towards more versatile and autonomous AI agents.
Google has unveiled its latest artificial intelligence breakthrough, the Gemini 2.5 Computer Use model, a specialized AI system capable of interacting with graphical user interfaces (GUIs) in a manner similar to human users
1
3
. This development marks a significant step towards creating more versatile and autonomous AI agents that can navigate web browsers and mobile applications.The Gemini 2.5 Computer Use model is built on the foundation of Gemini 2.5 Pro's visual understanding and reasoning capabilities
3
. It can perform a wide range of actions within web interfaces, including:These capabilities allow the model to complete complex tasks autonomously, such as booking appointments, organizing information, and interacting with various web applications
1
.Google claims that Gemini 2.5 Computer Use outperforms leading alternatives on multiple web and mobile control benchmarks
2
. Some notable results include:The model also demonstrates strong performance in mobile UI control tasks, despite being primarily optimized for web browsers
1
.Developers can access Gemini 2.5 Computer Use through the Gemini API in Google AI Studio and Vertex AI
3
. The model is exposed through a new 'computer_use' tool in the API, which operates within a loop, taking inputs such as user requests, screenshots of the environment, and action history3
.Related Stories
While Gemini 2.5 Computer Use represents a significant advancement, it does have some limitations:
4
.4
.2
.The introduction of Gemini 2.5 Computer Use opens up new possibilities for AI-driven task automation and assistance. Potential applications include:
1
As AI continues to evolve, models like Gemini 2.5 Computer Use are likely to play an increasingly important role in bridging the gap between human-computer interaction and AI capabilities.
Summarized by
Navi
[2]
[3]
18 Jun 2025•Technology
17 Dec 2024•Technology
26 Mar 2025•Technology