mini-omni by gpt-omni

Open-source multimodal LLM for real-time speech interaction

Created 1 year ago

3,511 stars

Top 13.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

Mini-Omni is an open-source multimodal large language model designed for real-time, end-to-end speech-to-speech conversational capabilities. It targets developers and researchers building interactive voice assistants and applications that require natural, continuous spoken interaction, enabling users to talk while the model "thinks" and generates responses.

How It Works

Mini-Omni integrates speech processing directly into the LLM pipeline, eliminating the need for separate Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models. It leverages Whisper for audio encoding and a proprietary "tts-adapter" (not open-sourced) for audio decoding and streaming output. This end-to-end approach allows for simultaneous text and audio generation, facilitating a "talking while thinking" experience with low latency.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n omni python=3.10), activate it, and run pip install -r requirements.txt.
Prerequisites: ffmpeg (via sudo apt-get install ffmpeg), PyAudio==0.2.14 for Streamlit demo.
Setup: Start the server (python3 server.py --ip '0.0.0.0' --port 60808) before running the Streamlit or Gradio demos. Set API_URL to the server address.
Docs: Hugging Face, Github, Technical report

Highlighted Details

Real-time speech-to-speech conversation without external ASR/TTS.
"Talking while thinking" with simultaneous text and audio generation.
Streaming audio output.
Batch inference for "Audio-to-Text" and "Audio-to-Audio".

Maintenance & Community

Uses Qwen2 as the LLM backbone, litGPT for training, Whisper for audio encoding, snac for audio decoding, CosyVoice for synthetic speech, and OpenOrca/MOSS for alignment.
Star History

Licensing & Compatibility

The repository does not explicitly state a license. The use of components like Qwen2, litGPT, Whisper, etc., implies adherence to their respective licenses.

Limitations & Caveats

The open-source version does not include the "tts-adapter," which is crucial for the full "talking while thinking" functionality. The model is trained exclusively on English, though it can understand other languages via Whisper's encoding, outputting only in English. Gradio demo latency may be higher due to audio streaming limitations.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days