LLaMA-Omni  by ictnlp

Speech-language model for low-latency, high-quality speech interaction

Created 1 year ago
3,070 stars

Top 15.6% on SourcePulse

GitHubView on GitHub
Project Summary

LLaMA-Omni provides a low-latency, end-to-end speech interaction model built on Llama-3.1-8B-Instruct, enabling simultaneous text and speech response generation from speech input. It targets researchers and developers seeking GPT-4o-level speech capabilities in an open-source framework.

How It Works

LLaMA-Omni integrates a speech encoder (borrowing from SLAM-LLM) and a speech-to-speech (s2s) adaptor with the Llama-3.1-8B-Instruct LLM. This architecture allows it to process spoken instructions, generate text, and synthesize speech concurrently, achieving low latency through efficient model design and training.

Quick Start & Requirements

  • Install: Clone repo, create conda env (python=3.10), pip install -e ., pip install -e . in fairseq repo, pip install flash-attn --no-build-isolation.
  • Prerequisites: Llama-3.1-8B-Omni model (Huggingface), Whisper-large-v3 model, unit-based HiFi-GAN vocoder.
  • Demo: Launch controller (python -m omni_speech.serve.controller), Gradio server (python -m omni_speech.serve.gradio_web_server), and model worker (python -m omni_speech.serve.model_worker).
  • Docs: https://github.com/ictnlp/LLaMA-Omni

Highlighted Details

  • Built on Llama-3.1-8B-Instruct for high-quality text responses.
  • Achieves speech interaction latency as low as 226ms.
  • Supports simultaneous text and speech response generation.
  • Trained in under 3 days on 4 GPUs.

Maintenance & Community

  • LLaMA-Omni2 released with models from 0.5B to 32B parameters.
  • Accepted at ICLR 2025.
  • Contact: fangqingkai21b@ict.ac.cn for questions.
  • Citation details provided.

Licensing & Compatibility

  • Code: Apache-2.0 License.
  • Model: Intended for academic research only; NOT for commercial purposes. Commercial use requires contacting fengyang@ict.ac.cn for a license.

Limitations & Caveats

The model is strictly for academic research and cannot be used commercially without a separate license. Streaming audio playback in the Gradio demo is implemented without autoplay due to stability concerns.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
96 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
2 more.

ChatTTS by 2noise

0.2%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.