Google Gemini Omni turns images, audio, and text into video with digital avatars

10 Sources

Share

Google unveiled Gemini Omni at I/O 2026, a multimodal AI model that can generate video from any combination of images, audio, video, and text. The AI video tool lets users create digital avatars that look and sound like them, edit videos through natural conversation, and modify real footage. All outputs include SynthID watermarks to combat deepfakes.

Google Gemini Omni Advances Multimodal AI Capabilities

At Google I/O 2026, the tech giant took a concrete step toward its vision of true multimodal AI with the launch of Google Gemini Omni, a new family of models that CEO Sundar Pichai says can "create anything from any input."

1

The announcement marks three years since Google first introduced Gemini with the goal of building a single neural network trained on text, image, audio, and video that could generate content in any format. Starting with video generation, Omni represents what Google calls "the next step" toward building AI that can model and simulate the real world.

2

Source: ZDNet

Source: ZDNet

Unlike Google's existing video model Veo, which lets users turn text and images into videos, Google Gemini Omni reasons across multiple input types simultaneously rather than simply stitching them together. Users can combine images, audio, video, and text to produce consistent, high-quality outputs.

1

Nicole Brichtova, DeepMind's director of product management, emphasized that this release is more than a Veo update: "It's the next step towards the progression of combining the intelligence of Gemini with the rendering capabilities of our media models."

1

Advanced Video Generation Grounded in Real-World Knowledge

The multimodal AI models demonstrate an improved understanding of physics, incorporating forces like gravity, kinetic energy, and fluid dynamics to create realistic video outputs.

4

Gemini Omni also leverages knowledge of history, science, and cultural context to bridge the gap from photorealism to meaningful storytelling. When given a simple prompt like "a claymation explainer of protein folding," the AI video tool quickly rendered a stop-motion video with a voice-over explaining how proteins fold into alpha helixes and beta sheets.

1

Source: CNET

Source: CNET

Users can generate video from text, generate video from images, or upload their own footage and transform it entirely. "Your video becomes a starting point for something you never could have filmed yourself," Google explained.

4

The ability to modify real-life footage into AI-generated content represents a significant leap beyond previous capabilities.

5

Edit Videos Through Natural Conversation and Digital Avatars

One of the standout features allows users to edit videos through natural conversation, with each instruction building on the last to keep characters and elements consistent.

4

Users can change environments, angles, styles, or specific details with plain text commands rather than complex editing software. The multimodal AI models also enable users to create a digital avatar that looks and sounds like them by recording themselves speaking a series of numbers during onboarding.

1

The video cloning capability has sparked both intrigue and concern. Content creators could use digital avatars for personalized videos or when they're having "a bad hair day, a bad voice day, or a bad attitude day," as one observer noted.

3

However, Brichtova and research engineer Gabe Barth-Maron positioned the feature more as a consumer tool for creating "personalized memes" rather than professional content.

1

SynthID Watermarks Address Deepfakes Concerns

To combat potential misuse and deepfakes, Google has implemented several safeguards. All videos created with Omni will include Google's SynthID digital watermark, an imperceptible fingerprint that allows users to verify whether videos were generated via Gemini products.

1

4

The watermarks are particularly important given Omni's ability to fully replace elements in videos, which could lead to concerning outcomes.

2

Google stated it has "clear policies to protect users from harm and governing the use of our AI tools."

4

For editing videos to change audio and speech, the company is still testing the capability to bring it to users "responsibly."

3

Gemini Omni Flash Rollout and Future Implications

The first model in the family, Gemini Omni Flash, is rolling out today to the Gemini app, YouTube Shorts, and AI creative studio Google Flow.

1

Flash can render 10 seconds of video, a limitation based on user expectations rather than technical constraints, with longer durations planned for the near future.

1

The model is available to all Google AI Plus, Pro, and Ultra subscribers globally.

4

Source: Lifehacker

Source: Lifehacker

While Google is positioning Omni Flash as a consumer tool, the enterprise and creative implications are clear. The company will make Omni available via API in the coming weeks for developers and enterprise customers.

2

Brichtova highlighted the model's text-rendering capabilities as particularly useful for advertising applications.

1

An end-to-end multimodal workflow could transform how advertisers and filmmakers work, though the technology's reception will depend on whether it can overcome the "uncanny valley" look that has plagued AI-driven video generation.

4

The long-term vision extends beyond video to generate images from audio or audio from video, positioning Gemini Omni as a step toward AI that can simulate reality rather than just predict text.

1

A more powerful Gemini Omni Pro version is in development for future release.

2

Today's Top Stories

TheOutpost.ai

Donโ€™t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
ยฉ 2026 TheOutpost.AI All rights reserved