speech_dataset by double22a

Speech datasets for recognition and synthesis

Created 5 years ago

463 stars

Top 64.7% on SourcePulse

Project Summary

This repository serves as a curated list of speech datasets, primarily for Chinese, English, Japanese, Korean, Russian, French, Spanish, and Turkish languages. It categorizes datasets by their application, including speech recognition, speech synthesis, speaker recognition, speaker diarization, and voice activity detection. The primary benefit is providing a centralized, organized reference for researchers and developers seeking speech data for various AI tasks.

How It Works

The repository presents a comprehensive table of datasets, detailing their names, durations in hours, download addresses (primarily OpenSLR, Hugging Face, and specific project pages), and remarks on their content or application. It is structured to allow users to quickly find relevant datasets based on language, task, and data size.

Quick Start & Requirements

No installation or specific requirements are mentioned, as this is a reference list. Users are directed to the provided URLs to access and download the datasets.

Highlighted Details

Extensive coverage of Chinese speech datasets, including large-scale corpora like WenetSpeech (10000h) and aidatatang_1505zh (1505h).
Multilingual support with significant datasets for English (e.g., The People's Speech, 31400h; VoxPopuli, 24100h+543h) and other languages.
Datasets are categorized by specific speech tasks such as recognition, synthesis, diarization, and voice activity detection.
Includes datasets for specialized tasks like singing voice synthesis (Opencpop) and Mandarin heavy accent speech.

Maintenance & Community

Information regarding maintainers, community channels, or specific update frequency is not provided in the README.

Licensing & Compatibility

Dataset licenses vary and are not explicitly stated here; users must refer to the individual dataset links for licensing details. Compatibility for commercial use depends on each dataset's specific license.

Limitations & Caveats

The README does not provide direct download links or scripts, requiring users to navigate to external sites. Some dataset entries have missing duration information or remarks, and the availability of "if available" datasets is not guaranteed. The "Free ST Chinese Mandarin Corpus" is listed under English datasets, which may be a categorization error.

speech_dataset by double22a

Explore Similar Projects

speech-recognition-uk by egorsmkv

SpeechTransProgress by kahne

awesome-russian-speech by alphacep

dataspeech by huggingface

ai-audio-datasets by Yuan-ManX

zeroth by goodatlas

wespeaker by wenet-e2e

awesome-diarization by wq2012

voice_datasets by jim-schwoebel

3D-Speaker by modelscope

icefall by k2-fsa

FunASR by modelscope