ctc-segmentation  by lumaku

Python package for audio segmentation and utterance alignment

created 5 years ago
338 stars

Top 82.6% on sourcepulse

GitHubView on GitHub
Project Summary

This Python package provides CTC segmentation for aligning audio files with text, enabling utterance-level segmentation and timestamp extraction. It is designed for researchers and developers working with large audio datasets and end-to-end ASR systems.

How It Works

The core of the package involves a three-step process: 1. Forward Propagation: Character probabilities from a CTC-based neural network are used to build a trellis diagram, with zero transition costs for start-of-sentence or blank tokens to handle preamble segments. 2. Backtracking: A most probable path of characters is determined through all time steps, starting from the highest probability for the last character. 3. Confidence Score: A confidence score for each utterance is derived from character alignment probabilities, aiding in the detection and filtering of low-quality segments.

Quick Start & Requirements

  • Install via pip: pip install ctc-segmentation
  • Requires a pre-trained CTC-based ASR model and its corresponding processor/tokenizer (e.g., from Hugging Face Transformers).
  • Example usage with wav2vec2-large-xlsr-53-english is provided in the README.
  • See official documentation for detailed integration examples with ESPnet, NeMo, and Speechbrain.

Highlighted Details

  • Supports both utterance-level alignment and word-level timestamp extraction.
  • Offers flexibility in data preparation, allowing alignment from token lists or raw text.
  • Includes parameters for fine-tuning alignment behavior, such as min_window_size and blank_transition_cost_zero.
  • Provides methods for segment clean-up based on confidence scores.

Maintenance & Community

  • The project is actively maintained by Ludwig W.
  • Integration examples are available for popular ASR toolkits like ESPnet, NeMo, and Speechbrain.

Licensing & Compatibility

  • The package is released under the MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Requires a CTC-based ASR model; it is not a standalone ASR system.
  • Transformer-based ASR models can lead to high memory consumption and slow inference on long audio files, potentially requiring audio partitioning.
  • The accuracy of alignments is dependent on the performance and potential frame shifts of the underlying ASR model.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.