Google introduces Gemini Omni: multimodal video generation from any input type
Google introduced Gemini Omni at Google I/O 2026, a new multimodal model that accepts images, audio, video, and text in any combination and generates video as its primary output. It is Google's first generally available model designed around native video generation, with image and audio output modalities planned for future releases.
What's new
- Inputs: images, audio, video, and text — any combination
- Primary output: high-quality video generation
- Planned outputs: image and audio (not yet available at launch)
- Scene consistency: characters stay consistent across edits, physics hold across frames, scenes maintain continuity from previous context
- Physics reasoning: the model predicts what should happen next in a scene, drawing on Gemini's broader knowledge base
- Video editing via natural language: describe changes conversationally rather than through technical controls
- Digital avatar creation: generate videos with custom digital personas
- SynthID watermark: all outputs include an imperceptible verification watermark, embedded automatically
- Availability at launch: Google AI Plus, Pro, and Ultra subscribers (Gemini app and Google Flow); YouTube Shorts and YouTube Create App users at no cost
- Coming soon: developer and enterprise API access
Context
Gemini Omni enters a video generation market that includes Runway (Gen-3 Alpha), OpenAI's Sora, and Kling from Kuaishou. Google's differentiation is the depth of multimodal input support and the integration of Gemini's existing language and reasoning capabilities into the generation process. The physics reasoning claim — that the model "reasons about what should happen next" — targets a known weakness of current video generation tools, which frequently produce motion that violates basic physical expectations.
The SynthID watermark is a continuation of Google DeepMind's content authentication policy. All Google AI-generated media carries SynthID markers; the policy extends to Gemini Omni at launch rather than being added retroactively.
The no-cost YouTube distribution gives Gemini Omni a path to scale that other video generation models have not had at launch. YouTube Shorts creators represent a large, immediately addressable audience for AI video tooling.
Why it matters
True any-input-to-video generation from a single model — mixing images, audio, text, and existing video — is a capability step beyond what current video generation tools offer, which typically require text-only or single-image input. The scene consistency claim, specifically that characters and physics persist across edits, addresses the most common production complaint about AI video: incoherent motion and character drift that makes longer generated videos unusable without significant post-editing.
The YouTube distribution story has real implications for creator adoption: unlike Sora (API-first, paid) and Runway (subscription, professional-oriented), Gemini Omni reaches YouTube Shorts creators at no cost on day one. If API access follows quickly, Gemini Omni could establish itself as the default AI video infrastructure for teams already embedded in Google Cloud and YouTube's ecosystem.
Corroborating sources
- Blog
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni/
“Gemini Omni is our new model that can create anything from any input — starting with video.”