mini-omni2 by gpt-omni

Omni-interactive model for multimodal understanding and real-time voice conversations

Created 1 year ago

1,857 stars

Top 23.2% on SourcePulse

Project Summary

Mini-Omni2 is an open-source, omni-interactive multimodal model designed to replicate GPT-4o's capabilities, including vision, speech, and duplex conversations. It targets researchers and developers seeking to build advanced conversational AI agents with real-time voice interaction and multimodal understanding.

How It Works

The model processes concatenated image, audio, and text features as input for comprehensive tasks. It employs a multi-stage training approach: encoder adaptation, modal alignment, and multimodal fine-tuning. For output, it utilizes text-guided delayed parallel generation to produce real-time speech responses, integrating components like Qwen2 for the LLM backbone, Whisper for audio encoding, and CLIP for image encoding.

Quick Start & Requirements

Installation: Requires Python 3.10 and ffmpeg. Install via pip install -r requirements.txt after cloning the repository.
Demo: A server (server.py) must be started first, followed by running a Streamlit demo (webui/omni_streamlit.py). PyAudio is required for local Streamlit execution.
Resources: No specific hardware requirements (GPU, CUDA) are mentioned, but typical LLM inference will benefit from GPU acceleration.
Links: Hugging Face, Github, Technical report

Highlighted Details

End-to-end speech-to-speech conversation without separate ASR/TTS models.
Real-time voice output with an interruption mechanism.
Omni-capable multimodal understanding (image, audio, text).
Leverages established models like Qwen2, Whisper, and CLIP.

Maintenance & Community

The project was released in October 2024. Key dependencies include Qwen2, litGPT, Whisper, CLIP, snac, and CosyVoice.

Licensing & Compatibility

The repository does not explicitly state a license. The inclusion of components from other projects (e.g., Whisper, CLIP) suggests potential licensing considerations for commercial use or closed-source integration.

Limitations & Caveats

The model is currently trained only on English, although it can process non-English audio via Whisper, the output remains English. The README notes potential issues with running the Streamlit demo in remote server configurations, requiring local execution with PyAudio.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days