Voila  by maitrix-org

Voice-language foundation models for real-time human-AI interaction

created 4 months ago
429 stars

Top 70.2% on sourcepulse

GitHubView on GitHub
Project Summary

Voila is a family of large voice-language foundation models designed for real-time, natural human-AI interaction. It targets researchers and developers seeking to advance conversational AI beyond traditional limitations of latency and vocal nuance. Voila offers end-to-end audio processing, enabling autonomous and rich voice dialogues with sub-200ms latency.

How It Works

Voila utilizes an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach integrates voice and language modeling, allowing for real-time streaming audio processing and low-latency responses. The architecture is optimized for autonomous, persona-driven interactions and supports a unified model for various audio tasks like ASR, TTS, and speech translation.

Quick Start & Requirements

  • Install/Run: Use python infer.py for CLI inference or python gradio_demo.py for a Gradio demo.
  • Prerequisites: Python, Hugging Face libraries. Specific hardware requirements (e.g., GPU) are not explicitly stated but are implied for optimal performance.
  • Resources: Model weights are available on Hugging Face.
  • Links: Project Page, Hugging Face, Online Demo

Highlighted Details

  • Achieves latency as low as 195 ms, surpassing average human response times.
  • Offers millions of pre-built and customizable voices with fast switching during conversations.
  • Unified model for ASR, TTS, and speech translation across six languages.
  • Achieves a Voila Benchmark score of 30.56, significantly outperforming SpeechGPT (13.29) and Moshi (11.45).
  • Reports a Word Error Rate (WER) of 4.8% on LibriSpeech test-clean without specific training data, competitive with state-of-the-art models.

Maintenance & Community

  • Released inference code and model weights on April 28, 2025.
  • Key contributors include Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, and Zhiting Hu.
  • Citation available for academic use.

Licensing & Compatibility

  • License details are not explicitly provided in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

  • The "Voila-Autonomous" model is noted as a preview.
  • Specific hardware requirements for optimal performance are not detailed.
  • Licensing information is absent, which may impact commercial adoption.
Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
428 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Feedback? Help us improve.