Discover and explore top open-source AI tools and projects—updated daily.
Large-scale Chinese speech recognition dataset
Top 57.0% on SourcePulse
WenetSpeech provides a large-scale Chinese speech recognition dataset exceeding 10,000 hours, designed for training robust Automatic Speech Recognition (ASR) systems. It caters to researchers and developers working on Chinese ASR, offering diverse data categories and subsets for various training scales.
How It Works
The dataset is compiled from YouTube and Podcast sources, with initial labeling performed using Optical Character Recognition (OCR) and ASR techniques. Data quality is enhanced through an end-to-end label error detection method. The dataset is categorized into "High Label" (>=0.95 confidence), "Weak Label" ([0.6, 0.95] confidence), and "Unlabel" data, enabling supervised, semi-supervised, and unsupervised training approaches.
Quick Start & Requirements
utils/download_wenetspeech.sh
. An alternative download method using modelscope
is also provided, requiring torch
and modelscope
installation.torch
, modelscope
. Access requires applying for a password via the official website.Highlighted Details
Maintenance & Community
The project acknowledges contributions and suggestions from GigaSpeech, Tencent Ethereal Audio Lab, Xi'an Future AI Innovation Center, and MindSpore. Communication channels include an official WeChat account and a WeChat group for discussions.
Licensing & Compatibility
The README mentions reading the license to apply for download, implying a specific usage agreement. Compatibility for commercial use or closed-source linking is not explicitly detailed but is likely governed by the terms of use upon password application.
Limitations & Caveats
Access to the dataset requires an application process and password, which may introduce a delay or gatekeeping for immediate use. The dataset's reliance on YouTube and Podcast sources may mean it contains background noise or varied audio quality in some segments.
2 years ago
Inactive