speech-dataset-generator  by davidmartinrius

Generate speech datasets from audio or URLs

Created 1 year ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a comprehensive toolkit for generating labeled audio datasets for text-to-speech and speech-to-text models. It automates audio transcription, enhancement, speaker identification, and segmentation, supporting diverse input sources like YouTube, LibriVox, and TED Talks, and outputting formats like LJSpeech.

How It Works

The generator leverages WhisperX for multilingual transcription and segmentation, pyannote for speaker diarization and embedding storage in ChromaDB, and offers optional audio enhancement via DeepFilterNet, ResembleAI, or Mayavoz. It processes audio files or streams, extracts segments based on specified time ranges, transcribes them, identifies speakers, and compiles metadata including speech rate metrics.

Quick Start & Requirements

  • Install: pip install -r requirements.txt or pip install -e .
  • Prerequisites: Python 3.10, HuggingFace token (for pyannote models), agreement to pyannote model terms.
  • OS: Tested on Ubuntu 22; macOS/Windows not tested.
  • Resources: Requires GPU for optimal performance with enhancement models and pyannote.
  • Docs: Project README

Highlighted Details

  • Supports multiple input sources: local files, folders, YouTube, LibriVox, and TED Talks.
  • Includes audio enhancement options (DeepFilterNet, ResembleAI, Mayavoz) for quality improvement.
  • Performs speaker diarization and stores embeddings in ChromaDB for automatic speaker naming.
  • Outputs datasets in LJSpeech format, with ongoing development for Metavoice-src and LibriSpeech.

Maintenance & Community

The project is actively developed by davidmartinrius. Links to community support (Discord/Slack) or a roadmap are not explicitly provided in the README.

Licensing & Compatibility

This project is licensed under the MIT License. Dependencies have varying licenses: BSD-4-Clause (WhisperX), MIT/Apache 2.0 (most others), and yt-dlp has no stated license. Commercial use is generally permitted under MIT, but users should verify compatibility with all dependencies.

Limitations & Caveats

The project is primarily tested on Ubuntu 22, with compatibility on other OS not guaranteed. Some audio segments may be discarded if they don't fit time ranges or have low quality, even after enhancement. The LibriSpeech dataset generation is in beta. A PyPI package exists but is non-functional due to dependency issues.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

awesome-diarization by wq2012

0.2%
2k
List of resources for speaker diarization
Created 6 years ago
Updated 1 month ago
Feedback? Help us improve.