Freeze-Omni  by VITA-MLLM

Speech-to-speech dialogue model with frozen LLM for low-latency interaction

created 9 months ago
334 stars

Top 83.3% on sourcepulse

GitHubView on GitHub
Project Summary

Freeze-Omni is a speech-to-speech dialogue model designed for low-latency and intelligent conversational AI. It targets applications requiring real-time voice interaction, such as virtual assistants and interactive customer service, by leveraging a frozen Large Language Model (LLM) to retain its intelligence while integrating speech capabilities.

How It Works

Freeze-Omni employs three key strategies for efficient speech-to-speech dialogue. It uses a chunk-wise streaming speech encoder for rapid input processing and an autoregressive (AR) based speech decoder with a single codebook for low-latency, streaming audio output. To enable duplex communication, it incorporates chunk-level state prediction, allowing the model to anticipate user interruptions and manage conversational flow. A "Model as a Server" approach further optimizes inference by managing multiple models and their caches independently, enabling flexible scheduling and resource utilization.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n freeze-omni python=3.10), activate it, and install dependencies (pip install -r requirements.txt).
  • Requirements: Requires downloading Freeze-Omni checkpoints and the Qwen2-7B-Instruct LLM, placing them in the root directory.
  • Inference: Run via Python command: CUDA_VISIBLE_DEVICES=0 python3 bin/inference.py --model_path ./checkpoints --input_wav ./assets/question.wav --output_wav ./assets/answer.wav --llm_path ./Qwen2-7B-Instruct --top_p 0.8 --top_k 20 --temperature 0.8.
  • Demo Server: Run via script: sh scripts/run_demo_server.sh (after configuring IP/port).
  • Links: Project Demo Page, arXiv Paper, Hugging Face.

Highlighted Details

  • Achieves low-latency speech-to-speech dialogue.
  • Retains LLM intelligence by using a frozen backbone.
  • Supports real-time interactive demos.
  • Utilizes a "Model as a Server" strategy for efficient inference.

Maintenance & Community

The project was launched in November 2024. Community channels are available via WeChat.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that the output has randomness and does not represent the developers' views. The project is trained on a large-scale corpus, and the developers disclaim responsibility for issues arising from its use.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Feedback? Help us improve.