Turn detection model for conversational voice AI
Top 43.3% on sourcepulse
This project provides an open-source, community-driven audio turn detection model designed to improve upon traditional Voice Activity Detection (VAD) methods in conversational AI. It aims to enable voice agents to respond more naturally by considering linguistic and acoustic cues, targeting developers and researchers in the voice AI space.
How It Works
The model utilizes Meta AI's Wav2Vec2-BERT backbone, a 580M parameter speech encoder, for its ability to leverage both acoustic and linguistic information. A simple two-layer classification head is added for sequence classification. This architecture is advantageous as it builds upon a powerful, pre-trained foundation, allowing for faster development and potential for fine-tuning on specific datasets.
Quick Start & Requirements
pip install -r requirements.txt
after setting up a Python 3.12 virtual environment.portaudio19-dev
on Ubuntu/Debian, brew install portaudio
on macOS).python record_and_predict.py
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The current proof-of-concept model is English-only, has relatively slow inference times, and its training data is limited, primarily focusing on pause filler words. Performance can be rapidly improved with more diverse data.
1 week ago
1 day