WenetSpeech by wenet-e2e

Large-scale Chinese speech recognition dataset

Created 5 years ago

621 stars

Top 52.4% on SourcePulse

Project Summary

WenetSpeech provides a large-scale Chinese speech recognition dataset exceeding 10,000 hours, designed for training robust Automatic Speech Recognition (ASR) systems. It caters to researchers and developers working on Chinese ASR, offering diverse data categories and subsets for various training scales.

How It Works

The dataset is compiled from YouTube and Podcast sources, with initial labeling performed using Optical Character Recognition (OCR) and ASR techniques. Data quality is enhanced through an end-to-end label error detection method. The dataset is categorized into "High Label" (>=0.95 confidence), "Weak Label" ([0.6, 0.95] confidence), and "Unlabel" data, enabling supervised, semi-supervised, and unsupervised training approaches.

Quick Start & Requirements

Installation: Download via utils/download_wenetspeech.sh. An alternative download method using modelscope is also provided, requiring torch and modelscope installation.
Prerequisites: Python 3.7 (for ModelScope method), torch, modelscope. Access requires applying for a password via the official website.
Links: Official website for download and license information.

Highlighted Details

Offers over 22,435 total hours of Chinese speech data, including 10,005 hours of high-quality labeled data.
High-quality data is classified into 10 domains, including audiobook, commentary, documentary, drama, interview, news, reading, talk, variety, and others.
Provides three training subsets (S, M, L) for ASR systems of varying data scales.
Includes evaluation sets (DEV, TEST_NET, TEST_MEETING) for comprehensive ASR system evaluation.

Maintenance & Community

The project acknowledges contributions and suggestions from GigaSpeech, Tencent Ethereal Audio Lab, Xi'an Future AI Innovation Center, and MindSpore. Communication channels include an official WeChat account and a WeChat group for discussions.

Licensing & Compatibility

The README mentions reading the license to apply for download, implying a specific usage agreement. Compatibility for commercial use or closed-source linking is not explicitly detailed but is likely governed by the terms of use upon password application.

Limitations & Caveats

Access to the dataset requires an application process and password, which may introduce a delay or gatekeeping for immediate use. The dataset's reliance on YouTube and Podcast sources may mean it contains background noise or varied audio quality in some segments.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days