api4sensevoice  by 0x5446

API and websocket server for real-time streaming voice applications

Created 1 year ago
508 stars

Top 61.4% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an API and WebSocket server for real-time speech processing, offering features like Voice Activity Detection (VAD), streaming transcription, and speaker verification. It targets developers building voice-enabled applications that require low-latency audio analysis and speaker identification.

How It Works

The server leverages the SenseVoice framework, integrating VAD for efficient audio processing and real-time streaming recognition. Speaker verification is achieved by comparing incoming audio against pre-registered voice samples, with recent optimizations focusing on accumulating audio data for improved accuracy and adding log-probabilities to confidence scores.

Quick Start & Requirements

  • Install dependencies using Conda and pip:
    conda create -n api4sensevoice python=3.10
    conda activate api4sensevoice
    conda install -c conda-forge ffmpeg
    pip install -r requirements.txt
    
  • Run the API server: python server.py --port <port_number>
  • Run the WebSocket server: python server_wss.py --port <port_number>
  • Speaker verification requires WAV audio files (16kHz, mono, 16-bit) placed in a speaker directory.
  • Official documentation and client testing page are available via links in the README.

Highlighted Details

  • Supports both REST API for single audio file transcription and WebSocket for real-time streaming.
  • Speaker verification can be enabled via a query parameter (sv=1) on the WebSocket endpoint.
  • Recent updates include optimized speaker verification and log-probability output for recognition confidence.
  • A roadmap indicates future plans for latency optimization.

Maintenance & Community

The project welcomes contributions and provides channels for bug reporting and feature requests. Specific community links (Discord/Slack) or social handles are not explicitly mentioned in the README.

Licensing & Compatibility

This project is licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The project is actively under development, with latency optimization listed as a future enhancement. The README does not specify hardware requirements beyond the need for ffmpeg.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.5%
9k
Speech-to-text library for realtime applications
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.