natural_voice_assistant  by LAION-AI

Open-source AI voice assistant for natural, empathic conversations

created 1 year ago
490 stars

Top 63.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

BUD-E is an open-source, fully local AI voice assistant designed for natural, real-time conversations with emotional intelligence and long-term memory. It targets users seeking an advanced, private voice assistant that can handle multi-speaker interactions and interruptions, running on consumer hardware.

How It Works

BUD-E integrates NVIDIA's FastConformer streaming STT, Microsoft's Phi-2 LLM, and StyleTTS2 for TTS. It aims for low-latency responses by fine-tuning STT and TTS models with LLM context, and plans to implement speculative decoding and end-of-speech detection for further speed improvements. The system is designed to manage conversational context and potentially incorporate multi-modal inputs and tool use.

Quick Start & Requirements

  • Install: Clone repo with git clone --recurse-submodules, create a conda environment with Python 3.10.12, install espeak-ng, PyTorch, and pip install -r requirements.txt.
  • Run: Execute python main.py.
  • Prerequisites: NVIDIA GPU (RTX 4090 demonstrated for 300-500ms latency), Python 3.10.12, espeak-ng, PyTorch. Ubuntu users may need portaudio19-dev.
  • Resources: Requires downloading pretrained models on first run.
  • Docs: Installation guide within the README.

Highlighted Details

  • Real-time conversational AI with empathy and emotional intelligence.
  • Handles multi-speaker conversations with interruptions and thinking pauses.
  • Operates fully locally on consumer hardware.
  • Demonstrated latency of 300-500ms on an NVIDIA RTX 4090.

Maintenance & Community

  • Collaboration between LAION, ELLIS Institute Tübingen, Collabora, and Tübingen AI Center.
  • Community contributions are invited via Discord or email (bud-e@laion.ai).
  • Roadmap includes significant planned improvements in latency, naturalness, memory, and functionality.

Licensing & Compatibility

  • License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The current version is a demo with ongoing development; many roadmap features are not yet implemented. Multi-speaker support is basic, and reliable speaker diarization is a planned improvement. WhisperSpeech TTS is noted as very slow on Windows due to torch.compile incompatibility.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Feedback? Help us improve.