ultravox  by fixie-ai

Multimodal LLM for real-time voice interactions

created 1 year ago
4,120 stars

Top 12.1% on sourcepulse

GitHubView on GitHub
Project Summary

Ultravox is a multimodal LLM designed for real-time voice interactions, eliminating the need for separate Automatic Speech Recognition (ASR) components. It directly processes audio into an LLM's high-dimensional space, enabling faster responses and future native understanding of paralinguistic cues like emotion and timing. The project targets developers building voice AI agents and researchers exploring direct audio-to-LLM integration.

How It Works

Ultravox extends existing open-weight LLMs (like Llama 3, Mistral, Gemma) by incorporating a multimodal projector. This projector converts raw audio directly into the LLM's embedding space, bypassing traditional ASR pipelines. This direct coupling is key to its low-latency performance and potential for richer audio understanding.

Quick Start & Requirements

  • Install dependencies using just install.
  • Requires Python 3.11 and Poetry for environment management.
  • Setup involves installing Homebrew for macOS/Linux tools.
  • Official demo available at ultravox.ai.
  • Weights available on Hugging Face.

Highlighted Details

  • Supports Llama 3, Mistral, and Gemma backbones, with 70B and 8B variants.
  • Training adapter/projector is efficient: 2-3 hours on 8x H100 GPUs for 14K steps.
  • Can be trained on custom audio data for new languages or improved performance.
  • Offers managed APIs for real-time voice AI agent development.

Maintenance & Community

  • Active development with releases in late 2024.
  • Community support via Discord.
  • Hiring for full-time roles.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking requires clarification.

Limitations & Caveats

The project is actively evolving, with current output being streaming text. Future versions aim to emit speech tokens for direct audio synthesis. The README does not specify the license, which is crucial for commercial adoption.

Health Check
Last commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
6
Star History
244 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.