Large dataset for speech recognition research
Top 50.3% on sourcepulse
GigaSpeech is a large-scale, evolving dataset designed for Automatic Speech Recognition (ASR) research, offering over 10,000 hours of high-quality transcribed English audio. It caters to researchers and practitioners needing diverse, real-world speech data for training and evaluating ASR systems, with subsets ranging from small for debugging to very large for industrial-scale experiments.
How It Works
The dataset comprises over 33,000 hours of audio from diverse sources like audiobooks, podcasts, and YouTube, with 10,000 hours meticulously transcribed by human annotators. It provides detailed metadata in a single JSON file, enabling granular control over data selection and segmentation for ASR tasks. The data is resampled to 16 kHz and compressed using Opus, with specific guidelines for audio and text pre-processing to ensure consistency across different ASR toolkits.
Quick Start & Requirements
utils/download_gigaspeech.sh
and toolkits/kaldi/gigaspeech_data_prep.sh
).jq
for metadata processing.Highlighted Details
Maintenance & Community
The project is a collaborative effort by volunteers. Contributions are welcomed, with ongoing efforts to add speaker information, more data sources, and support for additional tasks. Contact: gigaspeech@speechcolab.org.
Licensing & Compatibility
The dataset is intended for research purposes. Specific licensing details for commercial use are not explicitly stated in the README, but the data is derived from publicly available sources.
Limitations & Caveats
The dataset is an "evolving" corpus, meaning updates may occur. While the README mentions plans for speaker information, it is not yet included in the metadata. Users are advised to resample Opus audio to 16 kHz and handle conversational fillers and garbage tags as per the provided guidelines.
1 year ago
Inactive