WenetSpeech-Yue  by ASLP-lab

Large-scale Cantonese speech dataset and processing pipeline

Created 4 months ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

WenetSpeech-Yue provides a large-scale, multi-dimensionally annotated Cantonese speech corpus (21,800 hours) and the associated WenetSpeech-Pipe data processing pipeline. It addresses the scarcity of high-quality Cantonese speech resources for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) research. The project benefits researchers and developers by offering a comprehensive dataset with rich annotations, including speaker attributes, speech quality metrics, and character-level timestamps, alongside evaluation benchmarks.

How It Works

The WenetSpeech-Pipe pipeline systematically processes raw audio data by segmenting long recordings, annotating speaker attributes (age, gender) using tools like pyannote and Vox-Profile, and assessing speech quality via SNR and MOS scores. ASR transcriptions are generated using multiple models (SenseVoice, TeleASR, Whisper), refined through a ROVER-like voting mechanism, and further enhanced by an LLM for context-aware corrections. This multi-stage approach ensures high accuracy and rich metadata for downstream tasks.

Quick Start & Requirements

To set up, clone the repository and create a Conda environment with Python 3.10 and pynini==2.1.5. Install remaining dependencies via pip install -r requirements.txt. ASR inference requires specific command-line execution with model checkpoints and configurations. TTS inference involves downloading models from Hugging Face (ASLP-lab/WSYue-TTS) and using provided Python scripts with dependencies like funasr and torchaudio. GPU acceleration is recommended for inference.

Highlighted Details

  • Dataset Size: 21,800 hours of Cantonese speech across ten domains (Storytelling, Entertainment, Vlog, etc.).
  • Annotations: Includes audio path, duration, text confidence, speaker ID, SNR, DNSMOS, age, gender, and character-level timestamps.
  • Benchmarks: WSYue-eval provides comprehensive benchmarks for ASR and zero-shot Cantonese TTS, including diverse real-world scenarios and linguistic phenomena.
  • ASR Leaderboard: Features performance metrics for various ASR models on WSYue-eval, including Conformer-Yue, Paraformer, and Whisper variants.

Maintenance & Community

The project is associated with the ASLP@NPU group. Contact emails (lhli@mail.nwpu.edu.cn, gzhao@mail.nwpu.edu.cn) are provided for inquiries. A WeChat discussion group is available via a QR code scan. No specific details on active contributors, sponsorships, or a public roadmap are present in the provided text.

Licensing & Compatibility

The provided README content does not explicitly state the license under which the dataset or code is released. This lack of information poses a significant adoption blocker, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not detail specific limitations, known bugs, or unsupported platforms. While the dataset is described as large-scale and multi-dimensional, potential gaps in coverage or specific linguistic phenomena not fully represented are not mentioned. The absence of explicit licensing information is a critical caveat for potential users.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.0%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
54k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.