T-one  by voicekit-team

Streaming ASR pipeline for Russian and English

Created 9 months ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

T-one is a high-performance, streaming CTC-based Automatic Speech Recognition (ASR) pipeline specifically engineered for the Russian language and the telephony domain. It offers a ready-to-use solution for real-time transcription, benefiting developers and researchers requiring low-latency, high-throughput speech-to-text capabilities.

How It Works

The pipeline employs a Conformer-based acoustic model that processes audio in 300 ms chunks, preserving acoustic context across segment boundaries via hidden states. A novel log-probability splitter identifies phrase boundaries by detecting speech and silence frames, outputting phrases with timestamps. Transcription is finalized using either greedy decoding or a KenLM-based CTC beam search decoder, providing modularity and adaptability.

Quick Start & Requirements

A pre-built Docker image provides an immediate web-based demo at http://localhost:8080. For local development, Python (3.9+) and Poetry (2.1+) are required. Installation involves cloning the repository, setting up a virtual environment, and using make install and make up_dev, or poetry install -E demo followed by running the web service with uvicorn. Linux or macOS is recommended; Windows users should utilize WSL due to the KenLM dependency. A minimum of 4 CPU cores and 8 GB RAM is advised for smooth demo performance.

Highlighted Details

  • Achieves competitive Word Error Rates (WER) on telephony datasets, often outperforming larger models like Whisper large-v3 in specific categories.
  • Demonstrates high throughput via TensorRT optimization, reaching up to 57,344 Requests Per Second (RPS) on an NVIDIA H100 GPU.
  • Specialized for Russian language and telephony use cases, offering optimized performance for this domain.
  • Processes audio in small, 300 ms chunks for low-latency streaming recognition.

Maintenance & Community

The provided README does not detail specific contributors, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The project is released under the Apache 2.0 License, which permits commercial use and integration into closed-source applications.

Limitations & Caveats

The KenLM dependency lacks official Windows support, necessitating the use of WSL or containerized environments for Windows users, which can complicate setup and introduce potential dependency issues.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
469
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.1%
10k
End-to-end speech processing toolkit for various speech tasks
Created 8 years ago
Updated 3 days ago
Feedback? Help us improve.