mini-omni2  by gpt-omni

Omni-interactive model for multimodal understanding and real-time voice conversations

created 9 months ago
1,781 stars

Top 24.7% on sourcepulse

GitHubView on GitHub
Project Summary

Mini-Omni2 is an open-source, omni-interactive multimodal model designed to replicate GPT-4o's capabilities, including vision, speech, and duplex conversations. It targets researchers and developers seeking to build advanced conversational AI agents with real-time voice interaction and multimodal understanding.

How It Works

The model processes concatenated image, audio, and text features as input for comprehensive tasks. It employs a multi-stage training approach: encoder adaptation, modal alignment, and multimodal fine-tuning. For output, it utilizes text-guided delayed parallel generation to produce real-time speech responses, integrating components like Qwen2 for the LLM backbone, Whisper for audio encoding, and CLIP for image encoding.

Quick Start & Requirements

  • Installation: Requires Python 3.10 and ffmpeg. Install via pip install -r requirements.txt after cloning the repository.
  • Demo: A server (server.py) must be started first, followed by running a Streamlit demo (webui/omni_streamlit.py). PyAudio is required for local Streamlit execution.
  • Resources: No specific hardware requirements (GPU, CUDA) are mentioned, but typical LLM inference will benefit from GPU acceleration.
  • Links: Hugging Face, Github, Technical report

Highlighted Details

  • End-to-end speech-to-speech conversation without separate ASR/TTS models.
  • Real-time voice output with an interruption mechanism.
  • Omni-capable multimodal understanding (image, audio, text).
  • Leverages established models like Qwen2, Whisper, and CLIP.

Maintenance & Community

The project was released in October 2024. Key dependencies include Qwen2, litGPT, Whisper, CLIP, snac, and CosyVoice.

Licensing & Compatibility

The repository does not explicitly state a license. The inclusion of components from other projects (e.g., Whisper, CLIP) suggests potential licensing considerations for commercial use or closed-source integration.

Limitations & Caveats

The model is currently trained only on English, although it can process non-English audio via Whisper, the output remains English. The README notes potential issues with running the Streamlit demo in remote server configurations, requiring local execution with PyAudio.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
52 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Feedback? Help us improve.