Freeze-Omni by VITA-MLLM

Speech-to-speech dialogue model with frozen LLM for low-latency interaction

Created 1 year ago

360 stars

Top 77.9% on SourcePulse

Project Summary

Freeze-Omni is a speech-to-speech dialogue model designed for low-latency and intelligent conversational AI. It targets applications requiring real-time voice interaction, such as virtual assistants and interactive customer service, by leveraging a frozen Large Language Model (LLM) to retain its intelligence while integrating speech capabilities.

How It Works

Freeze-Omni employs three key strategies for efficient speech-to-speech dialogue. It uses a chunk-wise streaming speech encoder for rapid input processing and an autoregressive (AR) based speech decoder with a single codebook for low-latency, streaming audio output. To enable duplex communication, it incorporates chunk-level state prediction, allowing the model to anticipate user interruptions and manage conversational flow. A "Model as a Server" approach further optimizes inference by managing multiple models and their caches independently, enabling flexible scheduling and resource utilization.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n freeze-omni python=3.10), activate it, and install dependencies (pip install -r requirements.txt).
Requirements: Requires downloading Freeze-Omni checkpoints and the Qwen2-7B-Instruct LLM, placing them in the root directory.
Inference: Run via Python command: CUDA_VISIBLE_DEVICES=0 python3 bin/inference.py --model_path ./checkpoints --input_wav ./assets/question.wav --output_wav ./assets/answer.wav --llm_path ./Qwen2-7B-Instruct --top_p 0.8 --top_k 20 --temperature 0.8.
Demo Server: Run via script: sh scripts/run_demo_server.sh (after configuring IP/port).
Links: Project Demo Page, arXiv Paper, Hugging Face.

Highlighted Details

Achieves low-latency speech-to-speech dialogue.
Retains LLM intelligence by using a frozen backbone.
Supports real-time interactive demos.
Utilizes a "Model as a Server" strategy for efficient inference.

Maintenance & Community

The project was launched in November 2024. Community channels are available via WeChat.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that the output has randomness and does not represent the developers' views. The project is trained on a large-scale corpus, and the developers disclaim responsibility for issues arising from its use.

Freeze-Omni by VITA-MLLM

Explore Similar Projects

LLaMA-Omni2 by ictnlp

edgedict by theblackcat102

VITA-Audio by VITA-MLLM

dia2 by nari-labs

StreamSpeech by ictnlp

10x by 0xCrunchyy

QuickAgent by gkamradt

LLaMA-Omni by ictnlp

mini-omni by gpt-omni

whisper_streaming by ufal

moshi by kyutai-labs

CosyVoice by FunAudioLLM