Qwen2.5-Omni  by QwenLM

Multimodal model for text, audio, vision, and video processing with real-time speech generation

Created 6 months ago
3,623 stars

Top 13.4% on SourcePulse

GitHubView on GitHub
Project Summary

Qwen2.5-Omni is an end-to-end multimodal large language model designed for comprehensive perception and generation across text, audio, vision, and video. It targets researchers and developers needing advanced multimodal capabilities, offering real-time streaming responses with both text and synthesized speech.

How It Works

The model employs a novel "Thinker-Talker" architecture, integrating multimodal perception with real-time speech generation. A key innovation is TMRoPE (Time-aligned Multimodal RoPE), a position embedding method that synchronizes timestamps across video and audio inputs. This allows for seamless processing of diverse data streams and enables natural, robust, and real-time voice and video interactions.

Quick Start & Requirements

  • Transformers Usage: pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview accelerate
  • Utilities: pip install qwen-omni-utils[decord] (requires ffmpeg)
  • Dependencies: Python, transformers, accelerate, soundfile, torch. FlashAttention-2 is recommended for performance.
  • GPU Memory: BF16 inference for 7B model requires ~31GB (15s video), ~42GB (30s video), ~60GB (60s video).
  • Resources: Official Docker image available. Demo and cookbooks provided.

Highlighted Details

  • Achieves state-of-the-art performance on multimodal benchmarks like OmniBench.
  • Outperforms similarly sized models in audio, image, and video understanding tasks.
  • Supports real-time voice and video chat with streaming audio output.
  • Offers two distinct voice types (Chelsie, Ethan) for speech generation.

Maintenance & Community

  • Developed by Alibaba Cloud's Qwen team.
  • Active development with frequent updates.
  • Community channels include Hugging Face, ModelScope, Discord, and WeChat.
  • Cookbooks available for various use cases.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but usage examples and community links suggest open-source availability. Commercial use implications require verification.

Limitations & Caveats

  • High GPU memory requirements for inference, especially with longer video inputs.
  • decord installation may require building from source on non-Linux systems.
  • vLLM deployment for Qwen2.5-Omni currently only supports text output ("thinker" mode).
Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
9
Star History
124 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.