Qwen3-Omni  by QwenLM

Natively end-to-end omni-modal LLM

Created 3 weeks ago

New!

2,616 stars

Top 17.9% on SourcePulse

GitHubView on GitHub
Project Summary

Qwen3-Omni is an end-to-end, omni-modal Large Language Model developed by Alibaba Cloud's Qwen team. It processes text, audio, images, and video, and generates speech in real-time, targeting researchers and developers needing advanced multimodal AI capabilities. Its key benefit is native, low-latency interaction across diverse data types with strong multilingual support.

How It Works

Qwen3-Omni employs a novel MoE-based Thinker–Talker architecture with AuT pretraining for robust general representations. This design enables native end-to-end processing of text, audio, images, and video, facilitating real-time streaming responses in both text and natural speech. The multi-codebook approach minimizes latency, while its text-first pretraining and mixed multimodal training ensure strong performance across all modalities without regression.

Quick Start & Requirements

Highlighted Details

  • Achieves state-of-the-art performance on 32 out of 36 audio/video benchmarks and open-source SOTA on 22, comparable to Gemini 2.5 Pro and GPT-4o in audio tasks.
  • Supports 119 text languages, 19 speech input languages, and 10 speech output languages.
  • Offers low-latency streaming interaction with natural turn-taking and immediate text or speech responses.
  • Includes Qwen3-Omni-30B-A3B-Captioner, an open-source audio captioning model for detailed, low-hallucination descriptions.

Maintenance & Community

Developed by the Qwen team at Alibaba Cloud. Community channels include WeChat and Discord.

Licensing & Compatibility

The provided README does not explicitly state the license type or any compatibility notes for commercial use.

Limitations & Caveats

Hugging Face Transformers installation requires building from source as the PyPI package is not yet released. vLLM installation requires cloning a specific branch (qwen3_omni) and may involve building from source. vLLM serve currently only supports the "Thinking" model, not the "Instruct" model with audio output. Maintaining consistent use_audio_in_video parameter settings across multi-round conversations is crucial for predictable results. FlashAttention 2 requires specific hardware and precision settings (torch.float16 or torch.bfloat16). Significant GPU memory is required for inference, with requirements increasing substantially for longer video inputs.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
11
Star History
2,632 stars in the last 24 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.2%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.