Open-source multimodal LLM for real-time speech interaction
Top 14.7% on sourcepulse
Mini-Omni is an open-source multimodal large language model designed for real-time, end-to-end speech-to-speech conversational capabilities. It targets developers and researchers building interactive voice assistants and applications that require natural, continuous spoken interaction, enabling users to talk while the model "thinks" and generates responses.
How It Works
Mini-Omni integrates speech processing directly into the LLM pipeline, eliminating the need for separate Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models. It leverages Whisper for audio encoding and a proprietary "tts-adapter" (not open-sourced) for audio decoding and streaming output. This end-to-end approach allows for simultaneous text and audio generation, facilitating a "talking while thinking" experience with low latency.
Quick Start & Requirements
conda create -n omni python=3.10
), activate it, and run pip install -r requirements.txt
.ffmpeg
(via sudo apt-get install ffmpeg
), PyAudio==0.2.14
for Streamlit demo.python3 server.py --ip '0.0.0.0' --port 60808
) before running the Streamlit or Gradio demos. Set API_URL
to the server address.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The open-source version does not include the "tts-adapter," which is crucial for the full "talking while thinking" functionality. The model is trained exclusively on English, though it can understand other languages via Whisper's encoding, outputting only in English. Gradio demo latency may be higher due to audio streaming limitations.
9 months ago
1 week