vui by fluxions-ai

Conversational speech models for on-device use

Created 7 months ago

635 stars

Top 52.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Luis Capelo

Cofounder of Lightning AI

Akshat Bubna

Cofounder of Modal

Project Summary

Vui provides small, on-device conversational speech models for researchers and developers. It enables local execution of speech interaction, reducing reliance on cloud services and offering potential for real-time applications.

How It Works

Vui is a Llama-based transformer that predicts audio tokens. It utilizes Fluac, an audio tokenizer derived from Descript Audio Codec, which quantizes audio at 21.53Hz (a 4x reduction from 86Hz). This approach aims to create efficient, on-device models capable of contextual speech generation.

Quick Start & Requirements

Install: pip install -e . (Linux/Windows with uv)
Prerequisites: Requires accepting Hugging Face model terms for VAD and segmentation.
Demo: Run python demo.py for Gradio interface.
Hardware: Developed on two NVIDIA 4090 GPUs.

Highlighted Details

Models include Vui.BASE (40k hours audio), Vui.ABRAHAM (single speaker, context-aware), and Vui.COHOST (two speakers interacting).
Voice cloning is supported with the base model, though not perfect.
The project leverages Whisper, Audiocraft, and Descript Audio Codec.

Maintenance & Community

Primary developer: Harry Coultas Blum.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model is known to hallucinate, and performance is noted as being constrained by limited resources. Voice Activity Detection (VAD) is used to remove silence but can slow down processing.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days