Gemini 3 Flash's Agentic Vision uses code execution to inspect images step-by-step

4 Sources

Share

Google DeepMind introduced Agentic Vision in Gemini 3 Flash, transforming static image analysis into an active investigative process. The model uses Python code execution to zoom, annotate, and manipulate images through a Think, Act, Observe loop. This approach delivers a 5-10% quality boost across vision benchmarks and is now available through the Gemini API in Google AI Studio and Vertex AI.

Google DeepMind Transforms Image Understanding with Agentic Vision

Google DeepMind has introduced Agentic Vision in Gemini 3 Flash, marking a significant shift in how AI models process visual information

1

2

. Unlike conventional frontier AI models that analyze images in a single, static glance, this new capability treats vision as an active investigative process

4

. When traditional models miss fine-grained details like serial numbers on microchips or distant street signs, they're forced to guess. Agentic Vision addresses this limitation by combining visual reasoning and code execution, allowing Gemini 3 Flash to formulate plans to zoom in, inspect, and manipulate images through step-by-step reasoning

1

.

Source: Geeky Gadgets

Source: Geeky Gadgets

The technology leverages a Think, Act, Observe loop that fundamentally changes how the model interacts with visual data

4

. In the Think phase, Gemini 3 Flash analyzes the user query and creates a multi-step plan to extract relevant visual information. During the Act phase, the model generates and executes Python code to manipulate or analyze images, performing actions like cropping, rotating, and creating image annotation with bounding boxes

4

. The Observe phase then feeds the modified image back into the model's context window, allowing it to re-examine updated visual data before producing a final answer.

Real-Time Python Code Execution Eliminates Hallucinations

One of the most compelling aspects of code-based image analysis is its ability to replace probabilistic guessing with verifiable execution

1

. Standard language models often hallucinate during multi-step visual math tasks. Gemini 3 Flash bypasses this by offloading computation to a deterministic Python environment, grounding responses in visual evidence rather than uncertain estimates

1

.

In a practical demonstration of object counting within the Gemini app, when asked to count digits on a hand, the model executes code to draw directly on the canvas

1

. This "visual scratchpad" approach uses Python to create bounding boxes and numeric labels over each identified finger, ensuring the final answer is based on pixel-perfect understanding

4

. The real-time Python code execution also enables visual math capabilities, allowing the model to parse high-density tables and visualize findings through code.

Measurable Performance Gains Across Vision Benchmarks

Enabling code execution with Gemini 3 Flash delivers a consistent 5-10% quality boost across most vision benchmarks

2

. This improvement translates to fewer errors in real-world applications. PlanCheckSolver.com, an AI-based building plan validation platform, reported a 5% accuracy improvement after implementing the technology

4

. The system uses iterative refinement to crop and analyze high-resolution sections of building plans, examining roof edges and structural components by appending each cropped image back into the model's context to verify compliance with building codes.

The model's ability to dynamically interact with visual data extends to automatic zooming when fine-grained details are detected

1

. This dynamic image manipulation happens implicitly, without requiring explicit prompt nudges to trigger the behavior

4

. In a demonstration from Google AI Studio, Gemini 3 Flash identified data from a visual table, generated Python code to normalize values relative to prior state-of-the-art results, and produced a bar chart using Matplotlib

4

.

Availability and Future Expansion Plans

Agentic Vision is rolling out to the Gemini app with the Thinking model and is available today for developers through the Gemini API in Google AI Studio and Vertex AI

1

4

. Google DeepMind outlined several planned expansions for the capability. While the model currently performs implicit zooming, other actions like rotation and visual math require explicit prompts but are expected to become automatic in future updates

4

.

Future tools will allow Gemini to use web search and reverse image search to ground its understanding of the world even further

1

4

. The technology is also planned to expand beyond Gemini 3 Flash to other Gemini model sizes, suggesting broader adoption across Google's AI ecosystem

4

. This evolution positions AI vision systems to move from passive observation to active collaboration, addressing complex challenges across industries from logistics and engineering to research and urban planning

3

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo