speech-dataset-generator by davidmartinrius

Generate speech datasets from audio or URLs

Created 1 year ago

257 stars

Top 98.3% on SourcePulse

Project Summary

This project provides a comprehensive toolkit for generating labeled audio datasets for text-to-speech and speech-to-text models. It automates audio transcription, enhancement, speaker identification, and segmentation, supporting diverse input sources like YouTube, LibriVox, and TED Talks, and outputting formats like LJSpeech.

How It Works

The generator leverages WhisperX for multilingual transcription and segmentation, pyannote for speaker diarization and embedding storage in ChromaDB, and offers optional audio enhancement via DeepFilterNet, ResembleAI, or Mayavoz. It processes audio files or streams, extracts segments based on specified time ranges, transcribes them, identifies speakers, and compiles metadata including speech rate metrics.

Quick Start & Requirements

Install: pip install -r requirements.txt or pip install -e .
Prerequisites: Python 3.10, HuggingFace token (for pyannote models), agreement to pyannote model terms.
OS: Tested on Ubuntu 22; macOS/Windows not tested.
Resources: Requires GPU for optimal performance with enhancement models and pyannote.
Docs: Project README

Highlighted Details

Supports multiple input sources: local files, folders, YouTube, LibriVox, and TED Talks.
Includes audio enhancement options (DeepFilterNet, ResembleAI, Mayavoz) for quality improvement.
Performs speaker diarization and stores embeddings in ChromaDB for automatic speaker naming.
Outputs datasets in LJSpeech format, with ongoing development for Metavoice-src and LibriSpeech.

Maintenance & Community

The project is actively developed by davidmartinrius. Links to community support (Discord/Slack) or a roadmap are not explicitly provided in the README.

Licensing & Compatibility

This project is licensed under the MIT License. Dependencies have varying licenses: BSD-4-Clause (WhisperX), MIT/Apache 2.0 (most others), and yt-dlp has no stated license. Commercial use is generally permitted under MIT, but users should verify compatibility with all dependencies.

Limitations & Caveats

The project is primarily tested on Ubuntu 22, with compatibility on other OS not guaranteed. Some audio segments may be discarded if they don't fit time ranges or have low quality, even after enhancement. The LibriSpeech dataset generation is in beta. A PyPI package exists but is non-functional due to dependency issues.

speech-dataset-generator by davidmartinrius

Explore Similar Projects

StreamingKokoroJS by rhulha

edgedict by theblackcat102

awesome-large-audio-models by EmulationAI

nlp by Majdoddin

aTrain by JuergenFleiss

dia2 by nari-labs

sesame_csm_openai by phildougherty

Easy-Voice-Toolkit by Spr-Aachen

mimic-recording-studio by MycroftAI

whisper-plus by kadirnar

awesome-diarization by wq2012

Zonos by Zyphra