GigaSpeech by SpeechColab

Large dataset for speech recognition research

Created 4 years ago

715 stars

Top 48.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

GigaSpeech is a large-scale, evolving dataset designed for Automatic Speech Recognition (ASR) research, offering over 10,000 hours of high-quality transcribed English audio. It caters to researchers and practitioners needing diverse, real-world speech data for training and evaluating ASR systems, with subsets ranging from small for debugging to very large for industrial-scale experiments.

How It Works

The dataset comprises over 33,000 hours of audio from diverse sources like audiobooks, podcasts, and YouTube, with 10,000 hours meticulously transcribed by human annotators. It provides detailed metadata in a single JSON file, enabling granular control over data selection and segmentation for ASR tasks. The data is resampled to 16 kHz and compressed using Opus, with specific guidelines for audio and text pre-processing to ensure consistency across different ASR toolkits.

Quick Start & Requirements

Install/Run: Clone the repository and use provided shell scripts (e.g., utils/download_gigaspeech.sh and toolkits/kaldi/gigaspeech_data_prep.sh).
Prerequisites: Disk space for dataset storage, jq for metadata processing.
Links: HuggingFace, Leaderboard, Paper

Highlighted Details

Offers 10,000 hours of human-transcribed data and 33,000+ hours for unsupervised/semi-supervised learning.
Includes multiple training subsets (XS to XL) and evaluation subsets (Dev, Test).
Provides data preparation scripts for popular ASR toolkits like Kaldi and ESPnet.
Metadata includes segment-level timestamps and normalized text with punctuation.

Maintenance & Community

The project is a collaborative effort by volunteers. Contributions are welcomed, with ongoing efforts to add speaker information, more data sources, and support for additional tasks. Contact: gigaspeech@speechcolab.org.

Licensing & Compatibility

The dataset is intended for research purposes. Specific licensing details for commercial use are not explicitly stated in the README, but the data is derived from publicly available sources.

Limitations & Caveats

The dataset is an "evolving" corpus, meaning updates may occur. While the README mentions plans for speaker information, it is not yet included in the metadata. Users are advised to resample Opus audio to 16 kHz and handle conversational fillers and garbage tags as per the provided guidelines.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days