Voila  by maitrix-org

Voice-language foundation models for real-time human-AI interaction

Created 6 months ago
447 stars

Top 67.3% on SourcePulse

GitHubView on GitHub
Project Summary

Voila is a family of large voice-language foundation models designed for real-time, natural human-AI interaction. It targets researchers and developers seeking to advance conversational AI beyond traditional limitations of latency and vocal nuance. Voila offers end-to-end audio processing, enabling autonomous and rich voice dialogues with sub-200ms latency.

How It Works

Voila utilizes an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach integrates voice and language modeling, allowing for real-time streaming audio processing and low-latency responses. The architecture is optimized for autonomous, persona-driven interactions and supports a unified model for various audio tasks like ASR, TTS, and speech translation.

Quick Start & Requirements

  • Install/Run: Use python infer.py for CLI inference or python gradio_demo.py for a Gradio demo.
  • Prerequisites: Python, Hugging Face libraries. Specific hardware requirements (e.g., GPU) are not explicitly stated but are implied for optimal performance.
  • Resources: Model weights are available on Hugging Face.
  • Links: Project Page, Hugging Face, Online Demo

Highlighted Details

  • Achieves latency as low as 195 ms, surpassing average human response times.
  • Offers millions of pre-built and customizable voices with fast switching during conversations.
  • Unified model for ASR, TTS, and speech translation across six languages.
  • Achieves a Voila Benchmark score of 30.56, significantly outperforming SpeechGPT (13.29) and Moshi (11.45).
  • Reports a Word Error Rate (WER) of 4.8% on LibriSpeech test-clean without specific training data, competitive with state-of-the-art models.

Maintenance & Community

  • Released inference code and model weights on April 28, 2025.
  • Key contributors include Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, and Zhiting Hu.
  • Citation available for academic use.

Licensing & Compatibility

  • License details are not explicitly provided in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

  • The "Voila-Autonomous" model is noted as a preview.
  • Specific hardware requirements for optimal performance are not detailed.
  • Licensing information is absent, which may impact commercial adoption.
Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
5
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
2 more.

ChatTTS by 2noise

0.2%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.