Discover and explore top open-source AI tools and projects—updated daily.
Natively end-to-end omni-modal LLM
New!
Top 17.9% on SourcePulse
Qwen3-Omni is an end-to-end, omni-modal Large Language Model developed by Alibaba Cloud's Qwen team. It processes text, audio, images, and video, and generates speech in real-time, targeting researchers and developers needing advanced multimodal AI capabilities. Its key benefit is native, low-latency interaction across diverse data types with strong multilingual support.
How It Works
Qwen3-Omni employs a novel MoE-based Thinker–Talker architecture with AuT pretraining for robust general representations. This design enables native end-to-end processing of text, audio, images, and video, facilitating real-time streaming responses in both text and natural speech. The multi-codebook approach minimizes latency, while its text-first pretraining and mixed multimodal training ensure strong performance across all modalities without regression.
Quick Start & Requirements
pip install git+https://github.com/huggingface/transformers
and pip install accelerate
, qwen-omni-utils
. For vLLM, clone a specific branch (qwen3_omni
) from its repository and install from source. Docker images are provided.ffmpeg
is required for qwen-omni-utils
. FlashAttention 2 is recommended for reduced GPU memory usage and requires compatible hardware and torch.float16
or torch.bfloat16
.Highlighted Details
Maintenance & Community
Developed by the Qwen team at Alibaba Cloud. Community channels include WeChat and Discord.
Licensing & Compatibility
The provided README does not explicitly state the license type or any compatibility notes for commercial use.
Limitations & Caveats
Hugging Face Transformers installation requires building from source as the PyPI package is not yet released. vLLM installation requires cloning a specific branch (qwen3_omni
) and may involve building from source. vLLM serve currently only supports the "Thinking" model, not the "Instruct" model with audio output. Maintaining consistent use_audio_in_video
parameter settings across multi-round conversations is crucial for predictable results. FlashAttention 2 requires specific hardware and precision settings (torch.float16
or torch.bfloat16
). Significant GPU memory is required for inference, with requirements increasing substantially for longer video inputs.
6 days ago
Inactive