vui  by fluxions-ai

Conversational speech models for on-device use

created 1 month ago
626 stars

Top 53.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Vui provides small, on-device conversational speech models for researchers and developers. It enables local execution of speech interaction, reducing reliance on cloud services and offering potential for real-time applications.

How It Works

Vui is a Llama-based transformer that predicts audio tokens. It utilizes Fluac, an audio tokenizer derived from Descript Audio Codec, which quantizes audio at 21.53Hz (a 4x reduction from 86Hz). This approach aims to create efficient, on-device models capable of contextual speech generation.

Quick Start & Requirements

  • Install: pip install -e . (Linux/Windows with uv)
  • Prerequisites: Requires accepting Hugging Face model terms for VAD and segmentation.
  • Demo: Run python demo.py for Gradio interface.
  • Hardware: Developed on two NVIDIA 4090 GPUs.

Highlighted Details

  • Models include Vui.BASE (40k hours audio), Vui.ABRAHAM (single speaker, context-aware), and Vui.COHOST (two speakers interacting).
  • Voice cloning is supported with the base model, though not perfect.
  • The project leverages Whisper, Audiocraft, and Descript Audio Codec.

Maintenance & Community

  • Primary developer: Harry Coultas Blum.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model is known to hallucinate, and performance is noted as being constrained by limited resources. Voice Activity Detection (VAD) is used to remove silence but can slow down processing.

Health Check
Last commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
627 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Feedback? Help us improve.