Omni-interactive model for multimodal understanding and real-time voice conversations
Top 24.7% on sourcepulse
Mini-Omni2 is an open-source, omni-interactive multimodal model designed to replicate GPT-4o's capabilities, including vision, speech, and duplex conversations. It targets researchers and developers seeking to build advanced conversational AI agents with real-time voice interaction and multimodal understanding.
How It Works
The model processes concatenated image, audio, and text features as input for comprehensive tasks. It employs a multi-stage training approach: encoder adaptation, modal alignment, and multimodal fine-tuning. For output, it utilizes text-guided delayed parallel generation to produce real-time speech responses, integrating components like Qwen2 for the LLM backbone, Whisper for audio encoding, and CLIP for image encoding.
Quick Start & Requirements
ffmpeg
. Install via pip install -r requirements.txt
after cloning the repository.server.py
) must be started first, followed by running a Streamlit demo (webui/omni_streamlit.py
). PyAudio is required for local Streamlit execution.Highlighted Details
Maintenance & Community
The project was released in October 2024. Key dependencies include Qwen2, litGPT, Whisper, CLIP, snac, and CosyVoice.
Licensing & Compatibility
The repository does not explicitly state a license. The inclusion of components from other projects (e.g., Whisper, CLIP) suggests potential licensing considerations for commercial use or closed-source integration.
Limitations & Caveats
The model is currently trained only on English, although it can process non-English audio via Whisper, the output remains English. The README notes potential issues with running the Streamlit demo in remote server configurations, requiring local execution with PyAudio.
6 months ago
1 week