Freeze-Omni  by VITA-MLLM

Speech-to-speech dialogue model with frozen LLM for low-latency interaction

Created 10 months ago
342 stars

Top 80.8% on SourcePulse

GitHubView on GitHub
Project Summary

Freeze-Omni is a speech-to-speech dialogue model designed for low-latency and intelligent conversational AI. It targets applications requiring real-time voice interaction, such as virtual assistants and interactive customer service, by leveraging a frozen Large Language Model (LLM) to retain its intelligence while integrating speech capabilities.

How It Works

Freeze-Omni employs three key strategies for efficient speech-to-speech dialogue. It uses a chunk-wise streaming speech encoder for rapid input processing and an autoregressive (AR) based speech decoder with a single codebook for low-latency, streaming audio output. To enable duplex communication, it incorporates chunk-level state prediction, allowing the model to anticipate user interruptions and manage conversational flow. A "Model as a Server" approach further optimizes inference by managing multiple models and their caches independently, enabling flexible scheduling and resource utilization.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n freeze-omni python=3.10), activate it, and install dependencies (pip install -r requirements.txt).
  • Requirements: Requires downloading Freeze-Omni checkpoints and the Qwen2-7B-Instruct LLM, placing them in the root directory.
  • Inference: Run via Python command: CUDA_VISIBLE_DEVICES=0 python3 bin/inference.py --model_path ./checkpoints --input_wav ./assets/question.wav --output_wav ./assets/answer.wav --llm_path ./Qwen2-7B-Instruct --top_p 0.8 --top_k 20 --temperature 0.8.
  • Demo Server: Run via script: sh scripts/run_demo_server.sh (after configuring IP/port).
  • Links: Project Demo Page, arXiv Paper, Hugging Face.

Highlighted Details

  • Achieves low-latency speech-to-speech dialogue.
  • Retains LLM intelligence by using a frozen backbone.
  • Supports real-time interactive demos.
  • Utilizes a "Model as a Server" strategy for efficient inference.

Maintenance & Community

The project was launched in November 2024. Community channels are available via WeChat.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that the output has randomness and does not represent the developers' views. The project is trained on a large-scale corpus, and the developers disclaim responsibility for issues arising from its use.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.