Multimodal model for text, audio, vision, and video processing with real-time speech generation
Top 14.5% on sourcepulse
Qwen2.5-Omni is an end-to-end multimodal large language model designed for comprehensive perception and generation across text, audio, vision, and video. It targets researchers and developers needing advanced multimodal capabilities, offering real-time streaming responses with both text and synthesized speech.
How It Works
The model employs a novel "Thinker-Talker" architecture, integrating multimodal perception with real-time speech generation. A key innovation is TMRoPE (Time-aligned Multimodal RoPE), a position embedding method that synchronizes timestamps across video and audio inputs. This allows for seamless processing of diverse data streams and enables natural, robust, and real-time voice and video interactions.
Quick Start & Requirements
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview accelerate
pip install qwen-omni-utils[decord]
(requires ffmpeg
)transformers
, accelerate
, soundfile
, torch
. FlashAttention-2 is recommended for performance.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
decord
installation may require building from source on non-Linux systems.1 month ago
1 day