ten-vad  by TEN-framework

Low-latency voice activity detection for real-time AI

Created 6 months ago
1,549 stars

Top 26.8% on SourcePulse

GitHubView on GitHub
Project Summary

TEN VAD is a voice activity detector designed for real-time conversational AI, offering low latency and high performance. It targets developers building voice-enabled applications, providing superior accuracy and efficiency compared to common alternatives like WebRTC VAD and Silero VAD.

How It Works

TEN VAD employs a proprietary architecture optimized for temporal efficiency, enabling rapid speech-to-non-speech transition detection. This approach minimizes end-to-end latency in conversational AI systems and effectively handles short silences between speech segments, a common failure point for other VADs.

Quick Start & Requirements

  • Installation: git clone https://github.com/TEN-framework/ten-vad.git
  • Python Usage: pip install -r requirements.txt (for examples/plotting), pip install -U --force-reinstall -v git+https://github.com/TEN-framework/ten-vad.git (for direct use).
  • Dependencies: Python (3.8.19/3.10.14 verified), numpy, scipy, scikit-learn, matplotlib, torchaudio. ONNX usage requires onnxruntime >= 1.17.1. C/C++ usage requires Clang/Visual Studio/Xcode and CMake.
  • Platforms: Linux, Windows, macOS, Android, iOS, Web (WASM/JS).
  • Resources: Setup time varies by platform; core library size is lightweight (e.g., 306KB on Linux x64).
  • Demo: Hugging Face Space: https://github.com/user-attachments/assets/725a8318-d679-4b17-b9e4-e3dce999b298

Highlighted Details

  • Achieves superior precision-recall compared to WebRTC VAD and Silero VAD on benchmark datasets.
  • Demonstrates significantly lower latency in speech-to-non-speech transitions than Silero VAD.
  • Offers substantially lower computational complexity and smaller library size than Silero VAD across multiple platforms.
  • Provides cross-platform C compatibility and Python, JS (WASM), Android, and iOS bindings.

Maintenance & Community

  • Active development with recent updates integrating into k2-fsa/sherpa-onnx and releasing ONNX models.
  • Community channels: Discord, X, LinkedIn, Hugging Face.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Includes code derived from LPCNet, which is BSD-2-Clause and BSD-3-Clause licensed (details in NOTICES file).

Limitations & Caveats

  • Requires resampling to 16kHz for audio inputs at other sampling rates.
  • The default threshold of 0.5 may require tuning for specific applications.
  • iOS usage requires manual framework embedding and device signature configuration in Xcode.
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
4
Star History
78 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.