smart-turn by pipecat-ai

Turn detection model for conversational voice AI

Created 10 months ago

1,206 stars

Top 32.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

This project provides an open-source, community-driven audio turn detection model designed to improve upon traditional Voice Activity Detection (VAD) methods in conversational AI. It aims to enable voice agents to respond more naturally by considering linguistic and acoustic cues, targeting developers and researchers in the voice AI space.

How It Works

The model utilizes Meta AI's Wav2Vec2-BERT backbone, a 580M parameter speech encoder, for its ability to leverage both acoustic and linguistic information. A simple two-layer classification head is added for sequence classification. This architecture is advantageous as it builds upon a powerful, pre-trained foundation, allowing for faster development and potential for fine-tuning on specific datasets.

Quick Start & Requirements

Install via pip install -r requirements.txt after setting up a Python 3.12 virtual environment.
Requires PortAudio development libraries (e.g., portaudio19-dev on Ubuntu/Debian, brew install portaudio on macOS).
Initial startup takes ~30 seconds.
Example usage: python record_and_predict.py
Official HuggingFace page: pipecat-ai/smart-turn

Highlighted Details

Built on Wav2Vec2-BERT, leveraging 4.5M hours of unlabeled audio data.
Current inference speed: ~150ms on GPU, ~1500ms on CPU.
Training data consists of ~8,000 samples (human and synthetic), primarily focused on filler words.
Community contribution to model development and data generation is encouraged.

Maintenance & Community

Contributors include Marcus Eli and Mark Kwindla.
The project is part of the Pipecat ecosystem.
Data generation contribution guidelines are available.

Licensing & Compatibility

BSD 2-clause license, allowing for permissive use, forking, and contribution.
Suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

The current proof-of-concept model is English-only, has relatively slow inference times, and its training data is limited, primarily focusing on pause filler words. Performance can be rapidly improved with more diverse data.

Health Check

Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

60 stars in the last 30 days