api4sensevoice  by 0x5446

API and websocket server for real-time streaming voice applications

created 11 months ago
479 stars

Top 64.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an API and WebSocket server for real-time speech processing, offering features like Voice Activity Detection (VAD), streaming transcription, and speaker verification. It targets developers building voice-enabled applications that require low-latency audio analysis and speaker identification.

How It Works

The server leverages the SenseVoice framework, integrating VAD for efficient audio processing and real-time streaming recognition. Speaker verification is achieved by comparing incoming audio against pre-registered voice samples, with recent optimizations focusing on accumulating audio data for improved accuracy and adding log-probabilities to confidence scores.

Quick Start & Requirements

  • Install dependencies using Conda and pip:
    conda create -n api4sensevoice python=3.10
    conda activate api4sensevoice
    conda install -c conda-forge ffmpeg
    pip install -r requirements.txt
    
  • Run the API server: python server.py --port <port_number>
  • Run the WebSocket server: python server_wss.py --port <port_number>
  • Speaker verification requires WAV audio files (16kHz, mono, 16-bit) placed in a speaker directory.
  • Official documentation and client testing page are available via links in the README.

Highlighted Details

  • Supports both REST API for single audio file transcription and WebSocket for real-time streaming.
  • Speaker verification can be enabled via a query parameter (sv=1) on the WebSocket endpoint.
  • Recent updates include optimized speaker verification and log-probability output for recognition confidence.
  • A roadmap indicates future plans for latency optimization.

Maintenance & Community

The project welcomes contributions and provides channels for bug reporting and feature requests. Specific community links (Discord/Slack) or social handles are not explicitly mentioned in the README.

Licensing & Compatibility

This project is licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The project is actively under development, with latency optimization listed as a future enhancement. The README does not specify hardware requirements beyond the need for ffmpeg.

Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
65 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.