WenetSpeech  by wenet-e2e

Large-scale Chinese speech recognition dataset

Created 4 years ago
564 stars

Top 57.0% on SourcePulse

GitHubView on GitHub
Project Summary

WenetSpeech provides a large-scale Chinese speech recognition dataset exceeding 10,000 hours, designed for training robust Automatic Speech Recognition (ASR) systems. It caters to researchers and developers working on Chinese ASR, offering diverse data categories and subsets for various training scales.

How It Works

The dataset is compiled from YouTube and Podcast sources, with initial labeling performed using Optical Character Recognition (OCR) and ASR techniques. Data quality is enhanced through an end-to-end label error detection method. The dataset is categorized into "High Label" (>=0.95 confidence), "Weak Label" ([0.6, 0.95] confidence), and "Unlabel" data, enabling supervised, semi-supervised, and unsupervised training approaches.

Quick Start & Requirements

  • Installation: Download via utils/download_wenetspeech.sh. An alternative download method using modelscope is also provided, requiring torch and modelscope installation.
  • Prerequisites: Python 3.7 (for ModelScope method), torch, modelscope. Access requires applying for a password via the official website.
  • Links: Official website for download and license information.

Highlighted Details

  • Offers over 22,435 total hours of Chinese speech data, including 10,005 hours of high-quality labeled data.
  • High-quality data is classified into 10 domains, including audiobook, commentary, documentary, drama, interview, news, reading, talk, variety, and others.
  • Provides three training subsets (S, M, L) for ASR systems of varying data scales.
  • Includes evaluation sets (DEV, TEST_NET, TEST_MEETING) for comprehensive ASR system evaluation.

Maintenance & Community

The project acknowledges contributions and suggestions from GigaSpeech, Tencent Ethereal Audio Lab, Xi'an Future AI Innovation Center, and MindSpore. Communication channels include an official WeChat account and a WeChat group for discussions.

Licensing & Compatibility

The README mentions reading the license to apply for download, implying a specific usage agreement. Compatibility for commercial use or closed-source linking is not explicitly detailed but is likely governed by the terms of use upon password application.

Limitations & Caveats

Access to the dataset requires an application process and password, which may introduce a delay or gatekeeping for immediate use. The dataset's reliance on YouTube and Podcast sources may mean it contains background noise or varied audio quality in some segments.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.