dataspeech  by huggingface

Suite of scripts for tagging speech datasets, especially for TTS model development

created 1 year ago
373 stars

Top 77.1% on sourcepulse

GitHubView on GitHub
Project Summary

Data-Speech is a Python utility suite for annotating speech datasets with natural language descriptions of speaker characteristics, primarily for training text-to-speech (TTS) models like Parler-TTS. It targets researchers and developers in the speech AI domain, enabling reproducible and detailed dataset conditioning.

How It Works

The core process involves three stages: 1) predicting continuous audio features (e.g., speaking rate, SNR, reverberation) using main.py, 2) mapping these continuous features to discrete text bins (e.g., "slightly slowly," "very monotone") via metadata_to_text.py, and 3) generating natural language descriptions from these bins using LLMs via run_prompt_creation.py or run_prompt_creation_llm_swarm.py. This approach allows for fine-grained control over TTS model conditioning.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires Python and the datasets library. GPU acceleration is highly recommended for prompt creation and some feature extraction.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Reproduces annotation methods from Lyth and King's research paper.
  • Supports datasets from the Hugging Face Hub and local files.
  • Can generate annotations for speaker characteristics like speaking rate, pitch, SNR, and reverberation.
  • Leverages LLMs (e.g., Mistral, Llama 3) for natural language description generation.

Maintenance & Community

  • Actively maintained, with recent updates in August 2024 for Parler-TTS v1.
  • Built upon libraries like datasets, transformers, and accelerate.
  • Aims to support the TTS research community.

Licensing & Compatibility

  • The repository itself is likely under a permissive license (e.g., MIT, Apache) given its Hugging Face affiliation, but specific license details are not explicitly stated in the README.
  • Compatible with Hugging Face datasets and transformers ecosystem.

Limitations & Caveats

  • Prompt creation for multi-speaker datasets requires specific adaptation of script flags and potentially mapping files.
  • Some TODO items indicate ongoing development, such as accent classification and multilingual support.
  • TGI-based inference requires a SLURM cluster setup.
Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.