mini-omni  by gpt-omni

Open-source multimodal LLM for real-time speech interaction

created 11 months ago
3,375 stars

Top 14.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Mini-Omni is an open-source multimodal large language model designed for real-time, end-to-end speech-to-speech conversational capabilities. It targets developers and researchers building interactive voice assistants and applications that require natural, continuous spoken interaction, enabling users to talk while the model "thinks" and generates responses.

How It Works

Mini-Omni integrates speech processing directly into the LLM pipeline, eliminating the need for separate Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models. It leverages Whisper for audio encoding and a proprietary "tts-adapter" (not open-sourced) for audio decoding and streaming output. This end-to-end approach allows for simultaneous text and audio generation, facilitating a "talking while thinking" experience with low latency.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n omni python=3.10), activate it, and run pip install -r requirements.txt.
  • Prerequisites: ffmpeg (via sudo apt-get install ffmpeg), PyAudio==0.2.14 for Streamlit demo.
  • Setup: Start the server (python3 server.py --ip '0.0.0.0' --port 60808) before running the Streamlit or Gradio demos. Set API_URL to the server address.
  • Docs: Hugging Face, Github, Technical report

Highlighted Details

  • Real-time speech-to-speech conversation without external ASR/TTS.
  • "Talking while thinking" with simultaneous text and audio generation.
  • Streaming audio output.
  • Batch inference for "Audio-to-Text" and "Audio-to-Audio".

Maintenance & Community

  • Uses Qwen2 as the LLM backbone, litGPT for training, Whisper for audio encoding, snac for audio decoding, CosyVoice for synthetic speech, and OpenOrca/MOSS for alignment.
  • Star History

Licensing & Compatibility

  • The repository does not explicitly state a license. The use of components like Qwen2, litGPT, Whisper, etc., implies adherence to their respective licenses.

Limitations & Caveats

The open-source version does not include the "tts-adapter," which is crucial for the full "talking while thinking" functionality. The model is trained exclusively on English, though it can understand other languages via Whisper's encoding, outputting only in English. Gradio demo latency may be higher due to audio streaming limitations.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
89 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.