Speech-to-speech dialogue model with frozen LLM for low-latency interaction
Top 83.3% on sourcepulse
Freeze-Omni is a speech-to-speech dialogue model designed for low-latency and intelligent conversational AI. It targets applications requiring real-time voice interaction, such as virtual assistants and interactive customer service, by leveraging a frozen Large Language Model (LLM) to retain its intelligence while integrating speech capabilities.
How It Works
Freeze-Omni employs three key strategies for efficient speech-to-speech dialogue. It uses a chunk-wise streaming speech encoder for rapid input processing and an autoregressive (AR) based speech decoder with a single codebook for low-latency, streaming audio output. To enable duplex communication, it incorporates chunk-level state prediction, allowing the model to anticipate user interruptions and manage conversational flow. A "Model as a Server" approach further optimizes inference by managing multiple models and their caches independently, enabling flexible scheduling and resource utilization.
Quick Start & Requirements
conda create -n freeze-omni python=3.10
), activate it, and install dependencies (pip install -r requirements.txt
).CUDA_VISIBLE_DEVICES=0 python3 bin/inference.py --model_path ./checkpoints --input_wav ./assets/question.wav --output_wav ./assets/answer.wav --llm_path ./Qwen2-7B-Instruct --top_p 0.8 --top_k 20 --temperature 0.8
.sh scripts/run_demo_server.sh
(after configuring IP/port).Highlighted Details
Maintenance & Community
The project was launched in November 2024. Community channels are available via WeChat.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README mentions that the output has randomness and does not represent the developers' views. The project is trained on a large-scale corpus, and the developers disclaim responsibility for issues arising from its use.
2 months ago
1 day