ultravox by fixie-ai

Multimodal LLM for real-time voice interactions

Created 1 year ago

4,310 stars

Top 11.3% on SourcePulse

View on GitHub

7 Experts Love This Project

Thomas Wolf

Cofounder of Hugging Face

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Luis Capelo

Cofounder of Lightning AI

Jeff Hammerbacher

Cofounder of Cloudera

and 3 more!

Project Summary

Ultravox is a multimodal LLM designed for real-time voice interactions, eliminating the need for separate Automatic Speech Recognition (ASR) components. It directly processes audio into an LLM's high-dimensional space, enabling faster responses and future native understanding of paralinguistic cues like emotion and timing. The project targets developers building voice AI agents and researchers exploring direct audio-to-LLM integration.

How It Works

Ultravox extends existing open-weight LLMs (like Llama 3, Mistral, Gemma) by incorporating a multimodal projector. This projector converts raw audio directly into the LLM's embedding space, bypassing traditional ASR pipelines. This direct coupling is key to its low-latency performance and potential for richer audio understanding.

Quick Start & Requirements

Install dependencies using just install.
Requires Python 3.11 and Poetry for environment management.
Setup involves installing Homebrew for macOS/Linux tools.
Official demo available at ultravox.ai.
Weights available on Hugging Face.

Highlighted Details

Supports Llama 3, Mistral, and Gemma backbones, with 70B and 8B variants.
Training adapter/projector is efficient: 2-3 hours on 8x H100 GPUs for 14K steps.
Can be trained on custom audio data for new languages or improved performance.
Offers managed APIs for real-time voice AI agent development.

Maintenance & Community

Active development with releases in late 2024.
Community support via Discord.
Hiring for full-time roles.

Licensing & Compatibility

The specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking requires clarification.

Limitations & Caveats

The project is actively evolving, with current output being streaming text. Future versions aim to emit speech tokens for direct audio synthesis. The README does not specify the license, which is crucial for commercial adoption.

Health Check

Last Commit

4 weeks ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

35 stars in the last 30 days