Discover and explore top open-source AI tools and projects—updated daily.
QwenLMNatively end-to-end omni-modal LLM
Top 15.9% on SourcePulse
Qwen3-Omni is an end-to-end, omni-modal Large Language Model developed by Alibaba Cloud's Qwen team. It processes text, audio, images, and video, and generates speech in real-time, targeting researchers and developers needing advanced multimodal AI capabilities. Its key benefit is native, low-latency interaction across diverse data types with strong multilingual support.
How It Works
Qwen3-Omni employs a novel MoE-based Thinker–Talker architecture with AuT pretraining for robust general representations. This design enables native end-to-end processing of text, audio, images, and video, facilitating real-time streaming responses in both text and natural speech. The multi-codebook approach minimizes latency, while its text-first pretraining and mixed multimodal training ensure strong performance across all modalities without regression.
Quick Start & Requirements
pip install git+https://github.com/huggingface/transformers and pip install accelerate, qwen-omni-utils. For vLLM, clone a specific branch (qwen3_omni) from its repository and install from source. Docker images are provided.ffmpeg is required for qwen-omni-utils. FlashAttention 2 is recommended for reduced GPU memory usage and requires compatible hardware and torch.float16 or torch.bfloat16.Highlighted Details
Maintenance & Community
Developed by the Qwen team at Alibaba Cloud. Community channels include WeChat and Discord.
Licensing & Compatibility
The provided README does not explicitly state the license type or any compatibility notes for commercial use.
Limitations & Caveats
Hugging Face Transformers installation requires building from source as the PyPI package is not yet released. vLLM installation requires cloning a specific branch (qwen3_omni) and may involve building from source. vLLM serve currently only supports the "Thinking" model, not the "Instruct" model with audio output. Maintaining consistent use_audio_in_video parameter settings across multi-round conversations is crucial for predictable results. FlashAttention 2 requires specific hardware and precision settings (torch.float16 or torch.bfloat16). Significant GPU memory is required for inference, with requirements increasing substantially for longer video inputs.
1 month ago
Inactive
fixie-ai
dnhkng