T-one by voicekit-team

Streaming ASR pipeline for Russian and English

Created 1 year ago

270 stars

Top 95.1% on SourcePulse

Project Summary

Summary

T-one is a high-performance, streaming CTC-based Automatic Speech Recognition (ASR) pipeline specifically engineered for the Russian language and the telephony domain. It offers a ready-to-use solution for real-time transcription, benefiting developers and researchers requiring low-latency, high-throughput speech-to-text capabilities.

How It Works

The pipeline employs a Conformer-based acoustic model that processes audio in 300 ms chunks, preserving acoustic context across segment boundaries via hidden states. A novel log-probability splitter identifies phrase boundaries by detecting speech and silence frames, outputting phrases with timestamps. Transcription is finalized using either greedy decoding or a KenLM-based CTC beam search decoder, providing modularity and adaptability.

Quick Start & Requirements

A pre-built Docker image provides an immediate web-based demo at http://localhost:8080. For local development, Python (3.9+) and Poetry (2.1+) are required. Installation involves cloning the repository, setting up a virtual environment, and using make install and make up_dev, or poetry install -E demo followed by running the web service with uvicorn. Linux or macOS is recommended; Windows users should utilize WSL due to the KenLM dependency. A minimum of 4 CPU cores and 8 GB RAM is advised for smooth demo performance.

Highlighted Details

Achieves competitive Word Error Rates (WER) on telephony datasets, often outperforming larger models like Whisper large-v3 in specific categories.
Demonstrates high throughput via TensorRT optimization, reaching up to 57,344 Requests Per Second (RPS) on an NVIDIA H100 GPU.
Specialized for Russian language and telephony use cases, offering optimized performance for this domain.
Processes audio in small, 300 ms chunks for low-latency streaming recognition.

Maintenance & Community

The provided README does not detail specific contributors, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The project is released under the Apache 2.0 License, which permits commercial use and integration into closed-source applications.

Limitations & Caveats

The KenLM dependency lacks official Windows support, necessitating the use of WSL or containerized environments for Windows users, which can complicate setup and introduce potential dependency issues.

T-one by voicekit-team

Explore Similar Projects

parakeet.cpp by Frikallo

speech-recognition-uk by egorsmkv

insanely-fast-whisper-cli by ochen1

edgedict by theblackcat102

VITA-Audio by VITA-MLLM

SimulStreaming by ufal

pyctcdecode by kensho-technologies

TensorflowASR by Z-yq

hibiki by kyutai-labs

QuickAgent by gkamradt

athena by athena-team

espnet by espnet