Qwen2.5-Omni  by QwenLM

Multimodal model for text, audio, vision, and video processing with real-time speech generation

created 4 months ago
3,414 stars

Top 14.5% on sourcepulse

GitHubView on GitHub
Project Summary

Qwen2.5-Omni is an end-to-end multimodal large language model designed for comprehensive perception and generation across text, audio, vision, and video. It targets researchers and developers needing advanced multimodal capabilities, offering real-time streaming responses with both text and synthesized speech.

How It Works

The model employs a novel "Thinker-Talker" architecture, integrating multimodal perception with real-time speech generation. A key innovation is TMRoPE (Time-aligned Multimodal RoPE), a position embedding method that synchronizes timestamps across video and audio inputs. This allows for seamless processing of diverse data streams and enables natural, robust, and real-time voice and video interactions.

Quick Start & Requirements

  • Transformers Usage: pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview accelerate
  • Utilities: pip install qwen-omni-utils[decord] (requires ffmpeg)
  • Dependencies: Python, transformers, accelerate, soundfile, torch. FlashAttention-2 is recommended for performance.
  • GPU Memory: BF16 inference for 7B model requires ~31GB (15s video), ~42GB (30s video), ~60GB (60s video).
  • Resources: Official Docker image available. Demo and cookbooks provided.

Highlighted Details

  • Achieves state-of-the-art performance on multimodal benchmarks like OmniBench.
  • Outperforms similarly sized models in audio, image, and video understanding tasks.
  • Supports real-time voice and video chat with streaming audio output.
  • Offers two distinct voice types (Chelsie, Ethan) for speech generation.

Maintenance & Community

  • Developed by Alibaba Cloud's Qwen team.
  • Active development with frequent updates.
  • Community channels include Hugging Face, ModelScope, Discord, and WeChat.
  • Cookbooks available for various use cases.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but usage examples and community links suggest open-source availability. Commercial use implications require verification.

Limitations & Caveats

  • High GPU memory requirements for inference, especially with longer video inputs.
  • decord installation may require building from source on non-Linux systems.
  • vLLM deployment for Qwen2.5-Omni currently only supports text output ("thinker" mode).
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
21
Star History
627 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.6%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.