speech_dataset  by double22a

Speech datasets for recognition and synthesis

Created 4 years ago
424 stars

Top 69.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a curated list of speech datasets, primarily for Chinese, English, Japanese, Korean, Russian, French, Spanish, and Turkish languages. It categorizes datasets by their application, including speech recognition, speech synthesis, speaker recognition, speaker diarization, and voice activity detection. The primary benefit is providing a centralized, organized reference for researchers and developers seeking speech data for various AI tasks.

How It Works

The repository presents a comprehensive table of datasets, detailing their names, durations in hours, download addresses (primarily OpenSLR, Hugging Face, and specific project pages), and remarks on their content or application. It is structured to allow users to quickly find relevant datasets based on language, task, and data size.

Quick Start & Requirements

No installation or specific requirements are mentioned, as this is a reference list. Users are directed to the provided URLs to access and download the datasets.

Highlighted Details

  • Extensive coverage of Chinese speech datasets, including large-scale corpora like WenetSpeech (10000h) and aidatatang_1505zh (1505h).
  • Multilingual support with significant datasets for English (e.g., The People's Speech, 31400h; VoxPopuli, 24100h+543h) and other languages.
  • Datasets are categorized by specific speech tasks such as recognition, synthesis, diarization, and voice activity detection.
  • Includes datasets for specialized tasks like singing voice synthesis (Opencpop) and Mandarin heavy accent speech.

Maintenance & Community

Information regarding maintainers, community channels, or specific update frequency is not provided in the README.

Licensing & Compatibility

Dataset licenses vary and are not explicitly stated here; users must refer to the individual dataset links for licensing details. Compatibility for commercial use depends on each dataset's specific license.

Limitations & Caveats

The README does not provide direct download links or scripts, requiring users to navigate to external sites. Some dataset entries have missing duration information or remarks, and the availability of "if available" datasets is not guaranteed. The "Free ST Chinese Mandarin Corpus" is listed under English datasets, which may be a categorization error.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

awesome-diarization by wq2012

0.2%
2k
List of resources for speaker diarization
Created 6 years ago
Updated 1 month ago
Feedback? Help us improve.