Voila by maitrix-org

Voice-language foundation models for real-time human-AI interaction

Created 11 months ago

486 stars

Top 63.3% on SourcePulse

Project Summary

Voila is a family of large voice-language foundation models designed for real-time, natural human-AI interaction. It targets researchers and developers seeking to advance conversational AI beyond traditional limitations of latency and vocal nuance. Voila offers end-to-end audio processing, enabling autonomous and rich voice dialogues with sub-200ms latency.

How It Works

Voila utilizes an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach integrates voice and language modeling, allowing for real-time streaming audio processing and low-latency responses. The architecture is optimized for autonomous, persona-driven interactions and supports a unified model for various audio tasks like ASR, TTS, and speech translation.

Quick Start & Requirements

Install/Run: Use python infer.py for CLI inference or python gradio_demo.py for a Gradio demo.
Prerequisites: Python, Hugging Face libraries. Specific hardware requirements (e.g., GPU) are not explicitly stated but are implied for optimal performance.
Resources: Model weights are available on Hugging Face.
Links: Project Page, Hugging Face, Online Demo

Highlighted Details

Achieves latency as low as 195 ms, surpassing average human response times.
Offers millions of pre-built and customizable voices with fast switching during conversations.
Unified model for ASR, TTS, and speech translation across six languages.
Achieves a Voila Benchmark score of 30.56, significantly outperforming SpeechGPT (13.29) and Moshi (11.45).
Reports a Word Error Rate (WER) of 4.8% on LibriSpeech test-clean without specific training data, competitive with state-of-the-art models.

Maintenance & Community

Released inference code and model weights on April 28, 2025.
Key contributors include Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, and Zhiting Hu.
Citation available for academic use.

Licensing & Compatibility

License details are not explicitly provided in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The "Voila-Autonomous" model is noted as a preview.
Specific hardware requirements for optimal performance are not detailed.
Licensing information is absent, which may impact commercial adoption.

Health Check

Last Commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

1

Star History

3 stars in the last 30 days

Explore Similar Projects

alibabacloud-bailian-speech-demo by aliyun

Speech AI SDK demos for AlibabaCloud Bailian

Created 1 year ago

Updated 2 months ago

OpenVoiceChat by Finity-Alpha

Natural voice conversations with LLMs

Created 2 years ago

Updated 1 month ago

AIVoiceChat by KoljaB

Voice chat for low-latency AI companion interaction

Created 2 years ago

Updated 8 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic).

ollama-voice-mac by apeatling

Offline voice assistant for macOS

Created 2 years ago

Updated 6 months ago

LocalAIVoiceChat by KoljaB

Local AI voice chat for real-time conversations

Created 2 years ago

Updated 8 months ago

Starred by

Teknium

Teknium(Cofounder of Nous Research).

ChatWaifu by cjyaddone

Chatbot for simulating conversations with waifu-style characters

Created 3 years ago

Updated 1 year ago

mini-omni2 by gpt-omni

Omni-interactive model for multimodal understanding and real-time voice conversations

Created 1 year ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Michael Han

Michael Han(Cofounder of Unsloth), and

1 more.

Orpheus-TTS by canopyai

Open-source TTS for human-sounding speech, built on Llama-3b

Created 11 months ago

Updated 2 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

SenseVoice by FunAudioLLM

Multilingual speech model for understanding voice

Created 1 year ago

Updated 1 month ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

personaplex by NVIDIA

Full-duplex conversational speech model with real-time persona control

Created 1 month ago

Updated 2 weeks ago

sherpa-onnx by k2-fsa

Speech toolkit for local, offline speech AI tasks via ONNX

Created 3 years ago

Updated 23 hours ago

CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.