Qwen3-Omni by QwenLM

Natively end-to-end omni-modal LLM

Created 2 months ago

2,993 stars

Top 15.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

Qwen3-Omni is an end-to-end, omni-modal Large Language Model developed by Alibaba Cloud's Qwen team. It processes text, audio, images, and video, and generates speech in real-time, targeting researchers and developers needing advanced multimodal AI capabilities. Its key benefit is native, low-latency interaction across diverse data types with strong multilingual support.

How It Works

Qwen3-Omni employs a novel MoE-based Thinker–Talker architecture with AuT pretraining for robust general representations. This design enables native end-to-end processing of text, audio, images, and video, facilitating real-time streaming responses in both text and natural speech. The multi-codebook approach minimizes latency, while its text-first pretraining and mixed multimodal training ensure strong performance across all modalities without regression.

Quick Start & Requirements

Installation: Install from source via pip install git+https://github.com/huggingface/transformers and pip install accelerate, qwen-omni-utils. For vLLM, clone a specific branch (qwen3_omni) from its repository and install from source. Docker images are provided.
Prerequisites: ffmpeg is required for qwen-omni-utils. FlashAttention 2 is recommended for reduced GPU memory usage and requires compatible hardware and torch.float16 or torch.bfloat16.
Resource Footprint: Minimum GPU memory requirements for BF16 precision range from 68.74 GB (30B Thinking model) to over 144 GB (30B Instruct model with 120s video).
Links:
- Hugging Face Demo: https://huggingface.co/spaces/Qwen/Qwen3-Omni-Demo
- ModelScope Demo: https://modelscope.cn/studios/Qwen/Qwen3-Omni-Demo
- Blog: https://qwen.ai/blog?id=65f766fc2dcba7905c1cb69cc4cab90e94126bf4&from=research.latest-advancements-list
- Paper: https://arxiv.org/pdf/2509.17765
- Cookbooks: https://github.com/QwenLM/Qwen3-Omni/tree/main/cookbooks

Highlighted Details

Achieves state-of-the-art performance on 32 out of 36 audio/video benchmarks and open-source SOTA on 22, comparable to Gemini 2.5 Pro and GPT-4o in audio tasks.
Supports 119 text languages, 19 speech input languages, and 10 speech output languages.
Offers low-latency streaming interaction with natural turn-taking and immediate text or speech responses.
Includes Qwen3-Omni-30B-A3B-Captioner, an open-source audio captioning model for detailed, low-hallucination descriptions.

Maintenance & Community

Developed by the Qwen team at Alibaba Cloud. Community channels include WeChat and Discord.

Licensing & Compatibility

The provided README does not explicitly state the license type or any compatibility notes for commercial use.

Limitations & Caveats

Hugging Face Transformers installation requires building from source as the PyPI package is not yet released. vLLM installation requires cloning a specific branch (qwen3_omni) and may involve building from source. vLLM serve currently only supports the "Thinking" model, not the "Instruct" model with audio output. Maintaining consistent use_audio_in_video parameter settings across multi-round conversations is crucial for predictable results. FlashAttention 2 requires specific hardware and precision settings (torch.float16 or torch.bfloat16). Significant GPU memory is required for inference, with requirements increasing substantially for longer video inputs.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

216 stars in the last 30 days