GigaSpeech  by SpeechColab

Large dataset for speech recognition research

created 4 years ago
689 stars

Top 50.3% on sourcepulse

GitHubView on GitHub
Project Summary

GigaSpeech is a large-scale, evolving dataset designed for Automatic Speech Recognition (ASR) research, offering over 10,000 hours of high-quality transcribed English audio. It caters to researchers and practitioners needing diverse, real-world speech data for training and evaluating ASR systems, with subsets ranging from small for debugging to very large for industrial-scale experiments.

How It Works

The dataset comprises over 33,000 hours of audio from diverse sources like audiobooks, podcasts, and YouTube, with 10,000 hours meticulously transcribed by human annotators. It provides detailed metadata in a single JSON file, enabling granular control over data selection and segmentation for ASR tasks. The data is resampled to 16 kHz and compressed using Opus, with specific guidelines for audio and text pre-processing to ensure consistency across different ASR toolkits.

Quick Start & Requirements

  • Install/Run: Clone the repository and use provided shell scripts (e.g., utils/download_gigaspeech.sh and toolkits/kaldi/gigaspeech_data_prep.sh).
  • Prerequisites: Disk space for dataset storage, jq for metadata processing.
  • Links: HuggingFace, Leaderboard, Paper

Highlighted Details

  • Offers 10,000 hours of human-transcribed data and 33,000+ hours for unsupervised/semi-supervised learning.
  • Includes multiple training subsets (XS to XL) and evaluation subsets (Dev, Test).
  • Provides data preparation scripts for popular ASR toolkits like Kaldi and ESPnet.
  • Metadata includes segment-level timestamps and normalized text with punctuation.

Maintenance & Community

The project is a collaborative effort by volunteers. Contributions are welcomed, with ongoing efforts to add speaker information, more data sources, and support for additional tasks. Contact: gigaspeech@speechcolab.org.

Licensing & Compatibility

The dataset is intended for research purposes. Specific licensing details for commercial use are not explicitly stated in the README, but the data is derived from publicly available sources.

Limitations & Caveats

The dataset is an "evolving" corpus, meaning updates may occur. While the README mentions plans for speaker information, it is not yet included in the metadata. Users are advised to resample Opus audio to 16 kHz and handle conversational fillers and garbage tags as per the provided guidelines.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers).

voice_datasets by jim-schwoebel

0.3%
2k
Voice dataset list for voice/sound computing
created 6 years ago
updated 1 year ago
Feedback? Help us improve.