dataspeech by huggingface

Suite of scripts for tagging speech datasets, especially for TTS model development

Created 1 year ago

385 stars

Top 74.3% on SourcePulse

1 Expert Loves This Project

mlabonne

Head of Post-Training at Liquid AI

Project Summary

Data-Speech is a Python utility suite for annotating speech datasets with natural language descriptions of speaker characteristics, primarily for training text-to-speech (TTS) models like Parler-TTS. It targets researchers and developers in the speech AI domain, enabling reproducible and detailed dataset conditioning.

How It Works

The core process involves three stages: 1) predicting continuous audio features (e.g., speaking rate, SNR, reverberation) using main.py, 2) mapping these continuous features to discrete text bins (e.g., "slightly slowly," "very monotone") via metadata_to_text.py, and 3) generating natural language descriptions from these bins using LLMs via run_prompt_creation.py or run_prompt_creation_llm_swarm.py. This approach allows for fine-grained control over TTS model conditioning.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
Requires Python and the datasets library. GPU acceleration is highly recommended for prompt creation and some feature extraction.
Official documentation and examples are available within the repository.

Highlighted Details

Reproduces annotation methods from Lyth and King's research paper.
Supports datasets from the Hugging Face Hub and local files.
Can generate annotations for speaker characteristics like speaking rate, pitch, SNR, and reverberation.
Leverages LLMs (e.g., Mistral, Llama 3) for natural language description generation.

Maintenance & Community

Actively maintained, with recent updates in August 2024 for Parler-TTS v1.
Built upon libraries like datasets, transformers, and accelerate.
Aims to support the TTS research community.

Licensing & Compatibility

The repository itself is likely under a permissive license (e.g., MIT, Apache) given its Hugging Face affiliation, but specific license details are not explicitly stated in the README.
Compatible with Hugging Face datasets and transformers ecosystem.

Limitations & Caveats

Prompt creation for multi-speaker datasets requires specific adaptation of script flags and potentially mapping files.
Some TODO items indicate ongoing development, such as accent classification and multilingual support.
TGI-based inference requires a SLURM cluster setup.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

5 stars in the last 30 days

Explore Similar Projects

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 4 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI).

VoiceStar by jasonppy

Robust, duration-controllable TTS that extrapolates

Created 9 months ago

Updated 7 months ago

deepspeech-german by AASHISHAG

ASR module using Mozilla DeepSpeech for German speech

Created 6 years ago

Updated 2 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 1 year ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

GigaSpeech by SpeechColab

Large dataset for speech recognition research

Created 4 years ago

Updated 1 year ago

zamia-speech by gooofy

Speech tools/data for cloudless ASR, plus TTS voice training

Created 9 years ago

Updated 4 years ago

SLAM-LLM by X-LANCE

MLLM toolkit for speech, language, audio, and music processing

Created 2 years ago

Updated 2 months ago

athena by athena-team

Open-source speech processing engine for industrial/academic use

Created 6 years ago

Updated 3 years ago

vall-e by lifeiteng

PyTorch for zero-shot text-to-speech synthesis, re-implementing VALL-E

Created 3 years ago

Updated 4 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

icefall by k2-fsa

Speech-related recipes for various datasets using k2-fsa and lhotse

Created 4 years ago

Updated 1 month ago

FunASR by modelscope

Speech recognition toolkit for bridging research and industrial applications

Created 3 years ago

Updated 4 days ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Piotr Dąbkowski

Piotr Dąbkowski(Cofounder of ElevenLabs), and

2 more.

PaddleSpeech by PaddlePaddle

Speech toolkit for ASR, TTS, speaker verification, translation, and keyword spotting

Created 8 years ago

Updated 2 months ago

Feedback? Help us improve.