Discover and explore top open-source AI tools and projects—updated daily.
Generate speech datasets from audio or URLs
Top 99.6% on SourcePulse
This project provides a comprehensive toolkit for generating labeled audio datasets for text-to-speech and speech-to-text models. It automates audio transcription, enhancement, speaker identification, and segmentation, supporting diverse input sources like YouTube, LibriVox, and TED Talks, and outputting formats like LJSpeech.
How It Works
The generator leverages WhisperX for multilingual transcription and segmentation, pyannote for speaker diarization and embedding storage in ChromaDB, and offers optional audio enhancement via DeepFilterNet, ResembleAI, or Mayavoz. It processes audio files or streams, extracts segments based on specified time ranges, transcribes them, identifies speakers, and compiles metadata including speech rate metrics.
Quick Start & Requirements
pip install -r requirements.txt
or pip install -e .
Highlighted Details
Maintenance & Community
The project is actively developed by davidmartinrius. Links to community support (Discord/Slack) or a roadmap are not explicitly provided in the README.
Licensing & Compatibility
This project is licensed under the MIT License. Dependencies have varying licenses: BSD-4-Clause (WhisperX), MIT/Apache 2.0 (most others), and yt-dlp has no stated license. Commercial use is generally permitted under MIT, but users should verify compatibility with all dependencies.
Limitations & Caveats
The project is primarily tested on Ubuntu 22, with compatibility on other OS not guaranteed. Some audio segments may be discarded if they don't fit time ranges or have low quality, even after enhancement. The LibriSpeech dataset generation is in beta. A PyPI package exists but is non-functional due to dependency issues.
1 year ago
Inactive