LLaMA-Omni  by ictnlp

Speech-language model for low-latency, high-quality speech interaction

created 10 months ago
2,965 stars

Top 16.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLaMA-Omni provides a low-latency, end-to-end speech interaction model built on Llama-3.1-8B-Instruct, enabling simultaneous text and speech response generation from speech input. It targets researchers and developers seeking GPT-4o-level speech capabilities in an open-source framework.

How It Works

LLaMA-Omni integrates a speech encoder (borrowing from SLAM-LLM) and a speech-to-speech (s2s) adaptor with the Llama-3.1-8B-Instruct LLM. This architecture allows it to process spoken instructions, generate text, and synthesize speech concurrently, achieving low latency through efficient model design and training.

Quick Start & Requirements

  • Install: Clone repo, create conda env (python=3.10), pip install -e ., pip install -e . in fairseq repo, pip install flash-attn --no-build-isolation.
  • Prerequisites: Llama-3.1-8B-Omni model (Huggingface), Whisper-large-v3 model, unit-based HiFi-GAN vocoder.
  • Demo: Launch controller (python -m omni_speech.serve.controller), Gradio server (python -m omni_speech.serve.gradio_web_server), and model worker (python -m omni_speech.serve.model_worker).
  • Docs: https://github.com/ictnlp/LLaMA-Omni

Highlighted Details

  • Built on Llama-3.1-8B-Instruct for high-quality text responses.
  • Achieves speech interaction latency as low as 226ms.
  • Supports simultaneous text and speech response generation.
  • Trained in under 3 days on 4 GPUs.

Maintenance & Community

  • LLaMA-Omni2 released with models from 0.5B to 32B parameters.
  • Accepted at ICLR 2025.
  • Contact: fangqingkai21b@ict.ac.cn for questions.
  • Citation details provided.

Licensing & Compatibility

  • Code: Apache-2.0 License.
  • Model: Intended for academic research only; NOT for commercial purposes. Commercial use requires contacting fengyang@ict.ac.cn for a license.

Limitations & Caveats

The model is strictly for academic research and cannot be used commercially without a separate license. Streaming audio playback in the Gradio demo is implemented without autoplay due to stability concerns.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
66 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Feedback? Help us improve.