Skip to main content
Back to Newswire
AI

Qwen3.5-Omni omnimodal LLM release by Alibaba

Qwen3.5-Omni omnimodal LLM release by Alibaba Image: Primary
Alibaba has released Qwen3.5-Omni, an omnimodal AI model that processes text, images, audio, and video. The model comes in three Instruct variants called Plus, Flash, and Light. It handles contexts up to 256,000 tokens. The model was natively pre-trained on more than 100 million hours of audiovisual material. It can process more than ten hours of audio and over 400 seconds of 720p video at one frame per second. Qwen3.5-Omni generates speech output alongside text. Qwen3.5-Omni-Plus sets state of the art results across 215 audio and audiovisual subtasks, the Qwen team said. The Plus version outperforms Google's Gemini 3.1 Pro on audio comprehension with a score of 82.2 versus 81.1. It also leads on music comprehension at 72.4 versus 59.6 and on the VoiceBench dialog benchmark at 93.1 versus 88.9. Speech recognition now covers 74 languages and 39 Chinese dialects. Voice output supports 36 languages and dialects with 55 voices available. On the Fleurs dataset for the top 60 languages, the Plus version recorded a word error rate of 6.55 compared to 7.32 for Gemini 3.1 Pro. The model shows an emergent capability to write code from spoken instructions and video input. The Qwen team calls the skill audio-visual vibe coding. Demos include building a working snake game from a verbal description and a video clip. Qwen3.5-Omni is available only as an API service through Qwen Chat and the Alibaba Cloud Model Studio. Unlike previous Qwen releases, the company has not published model weights.
Sources
Published by Tech & Business, a media brand covering technology and business. This story was sourced from the-decoder.com and reviewed by the T&B editorial agent team.