ctc-segmentation by lumaku

Python package for audio segmentation and utterance alignment

Created 5 years ago

345 stars

Top 80.3% on SourcePulse

1 Expert Loves This Project

patrickvonplaten

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

This Python package provides CTC segmentation for aligning audio files with text, enabling utterance-level segmentation and timestamp extraction. It is designed for researchers and developers working with large audio datasets and end-to-end ASR systems.

How It Works

The core of the package involves a three-step process: 1. Forward Propagation: Character probabilities from a CTC-based neural network are used to build a trellis diagram, with zero transition costs for start-of-sentence or blank tokens to handle preamble segments. 2. Backtracking: A most probable path of characters is determined through all time steps, starting from the highest probability for the last character. 3. Confidence Score: A confidence score for each utterance is derived from character alignment probabilities, aiding in the detection and filtering of low-quality segments.

Quick Start & Requirements

Install via pip: pip install ctc-segmentation
Requires a pre-trained CTC-based ASR model and its corresponding processor/tokenizer (e.g., from Hugging Face Transformers).
Example usage with wav2vec2-large-xlsr-53-english is provided in the README.
See official documentation for detailed integration examples with ESPnet, NeMo, and Speechbrain.

Highlighted Details

Supports both utterance-level alignment and word-level timestamp extraction.
Offers flexibility in data preparation, allowing alignment from token lists or raw text.
Includes parameters for fine-tuning alignment behavior, such as min_window_size and blank_transition_cost_zero.
Provides methods for segment clean-up based on confidence scores.

Maintenance & Community

The project is actively maintained by Ludwig W.
Integration examples are available for popular ASR toolkits like ESPnet, NeMo, and Speechbrain.

Licensing & Compatibility

The package is released under the MIT License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Requires a CTC-based ASR model; it is not a standalone ASR system.
Transformer-based ASR models can lead to high memory consumption and slow inference on long audio files, potentially requiring audio partitioning.
The accuracy of alignments is dependent on the performance and potential frame shifts of the underlying ASR model.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

2 stars in the last 30 days

Explore Similar Projects

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 4 months ago

reverb by revdotcom

Open-source inference code for speech recognition and diarization models

Created 1 year ago

Updated 8 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic).

echogarden by echogarden-project

Cross-platform speech toolset for command-line or Node.js use

Created 2 years ago

Updated 4 months ago

edgedict by theblackcat102

RNN-Transducer for online speech recognition

Created 5 years ago

Updated 4 years ago

Fun-ASR by FunAudioLLM

Advanced speech recognition toolkit for global audio

Created 3 weeks ago

Updated 3 days ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 8 months ago

Updated 7 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

GigaSpeech by SpeechColab

Large dataset for speech recognition research

Created 4 years ago

Updated 1 year ago

Starred by

Alexander Borzunov

Alexander Borzunov(Research Scientist at OpenAI).

speech_course by yandexdataschool

Speech processing course materials

Created 4 years ago

Updated 5 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

whisper-plus by kadirnar

Speech-to-text toolkit for enhanced audio processing

Created 2 years ago

Updated 1 month ago

Montreal-Forced-Aligner by MontrealCorpusTools

Forced alignment for speech datasets

Created 10 years ago

Updated 1 month ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

2 more.

whisper-diarization by MahmoudAshraf97

ASR pipeline for speaker diarization

Created 3 years ago

Updated 1 month ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm),

Paul Gauthier

Paul Gauthier(Founder of Aider), and

9 more.

whisperX by m-bain

ASR tool for accurate, batched, word-level Whisper transcriptions

Created 3 years ago

Updated 2 months ago

Feedback? Help us improve.