Qwen2.5-Omni by QwenLM

Multimodal model for text, audio, vision, and video processing with real-time speech generation

Created 9 months ago

3,874 stars

Top 12.4% on SourcePulse

View on GitHub

5 Experts Love This Project

Cofounder of Lightning AI

and 1 more!

Project Summary

Qwen2.5-Omni is an end-to-end multimodal large language model designed for comprehensive perception and generation across text, audio, vision, and video. It targets researchers and developers needing advanced multimodal capabilities, offering real-time streaming responses with both text and synthesized speech.

How It Works

The model employs a novel "Thinker-Talker" architecture, integrating multimodal perception with real-time speech generation. A key innovation is TMRoPE (Time-aligned Multimodal RoPE), a position embedding method that synchronizes timestamps across video and audio inputs. This allows for seamless processing of diverse data streams and enables natural, robust, and real-time voice and video interactions.

Quick Start & Requirements

Transformers Usage: pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview accelerate
Utilities: pip install qwen-omni-utils[decord] (requires ffmpeg)
Dependencies: Python, transformers, accelerate, soundfile, torch. FlashAttention-2 is recommended for performance.
GPU Memory: BF16 inference for 7B model requires ~31GB (15s video), ~42GB (30s video), ~60GB (60s video).
Resources: Official Docker image available. Demo and cookbooks provided.

Highlighted Details

Achieves state-of-the-art performance on multimodal benchmarks like OmniBench.
Outperforms similarly sized models in audio, image, and video understanding tasks.
Supports real-time voice and video chat with streaming audio output.
Offers two distinct voice types (Chelsie, Ethan) for speech generation.

Maintenance & Community

Developed by Alibaba Cloud's Qwen team.
Active development with frequent updates.
Community channels include Hugging Face, ModelScope, Discord, and WeChat.
Cookbooks available for various use cases.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but usage examples and community links suggest open-source availability. Commercial use implications require verification.

Limitations & Caveats

High GPU memory requirements for inference, especially with longer video inputs.
decord installation may require building from source on non-Linux systems.
vLLM deployment for Qwen2.5-Omni currently only supports text output ("thinker" mode).

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

49 stars in the last 30 days