smart-turn  by pipecat-ai

Turn detection model for conversational voice AI

created 5 months ago
840 stars

Top 43.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an open-source, community-driven audio turn detection model designed to improve upon traditional Voice Activity Detection (VAD) methods in conversational AI. It aims to enable voice agents to respond more naturally by considering linguistic and acoustic cues, targeting developers and researchers in the voice AI space.

How It Works

The model utilizes Meta AI's Wav2Vec2-BERT backbone, a 580M parameter speech encoder, for its ability to leverage both acoustic and linguistic information. A simple two-layer classification head is added for sequence classification. This architecture is advantageous as it builds upon a powerful, pre-trained foundation, allowing for faster development and potential for fine-tuning on specific datasets.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after setting up a Python 3.12 virtual environment.
  • Requires PortAudio development libraries (e.g., portaudio19-dev on Ubuntu/Debian, brew install portaudio on macOS).
  • Initial startup takes ~30 seconds.
  • Example usage: python record_and_predict.py
  • Official HuggingFace page: pipecat-ai/smart-turn

Highlighted Details

  • Built on Wav2Vec2-BERT, leveraging 4.5M hours of unlabeled audio data.
  • Current inference speed: ~150ms on GPU, ~1500ms on CPU.
  • Training data consists of ~8,000 samples (human and synthetic), primarily focused on filler words.
  • Community contribution to model development and data generation is encouraged.

Maintenance & Community

  • Contributors include Marcus Eli and Mark Kwindla.
  • The project is part of the Pipecat ecosystem.
  • Data generation contribution guidelines are available.

Licensing & Compatibility

  • BSD 2-clause license, allowing for permissive use, forking, and contribution.
  • Suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

The current proof-of-concept model is English-only, has relatively slow inference times, and its training data is limited, primarily focusing on pause filler words. Performance can be rapidly improved with more diverse data.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
6
Star History
154 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.