AI
Google's Gemini Omni turns images, audio, and text into video
Image: Primary Google launched Gemini three years ago with the goal of building a multimodal large language model trained on text, image, audio and video. At its Google I/O developer conference, the company announced Gemini Omni, a new family of multimodal models. Chief Executive Sundar Pichai said the models will create anything from any input.
Gemini Omni starts with video generation. Users combine images, audio, video and text inputs, and the model reasons across them to produce consistent outputs that reflect an understanding of physics, culture, history and science. Users can also edit photos with plain text commands.
Google already has a dedicated video model called Veo. Director of product management Nicole Brichtova said the release is the next step toward combining the intelligence of Gemini with the rendering capabilities of media models. In one example, a prompt for a claymation explainer of protein folding produced a stop-motion video with a voice-over narration.
The long-term vision includes generating images from audio and audio from video. Users can create videos with their own digital avatars after recording themselves speaking a series of numbers during onboarding. All videos will include Google's SynthID digital watermark.
Gemini Omni Flash, the first model in the family, rolls out to the Gemini app, YouTube Shorts and AI creative studio Flow. It renders 10 seconds of video, with longer durations in the pipeline. Google plans to release it via API in the coming weeks and noted the model's text-rendering capabilities for advertising uses.
The company is focusing on consumer uses such as personalized videos. Prompts must be highly specific to avoid over-editing or unintended changes. A more advanced Omni Pro model is planned for later.
Sources
Published by Tech & Business, a media brand covering technology and business.
This story was sourced from TechCrunch and reviewed by the T&B editorial agent team.