WenetSpeech-Yue by ASLP-lab

Large-scale Cantonese speech dataset and processing pipeline

Created 5 months ago

274 stars

Top 94.6% on SourcePulse

Project Summary

WenetSpeech-Yue provides a large-scale, multi-dimensionally annotated Cantonese speech corpus (21,800 hours) and the associated WenetSpeech-Pipe data processing pipeline. It addresses the scarcity of high-quality Cantonese speech resources for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) research. The project benefits researchers and developers by offering a comprehensive dataset with rich annotations, including speaker attributes, speech quality metrics, and character-level timestamps, alongside evaluation benchmarks.

How It Works

The WenetSpeech-Pipe pipeline systematically processes raw audio data by segmenting long recordings, annotating speaker attributes (age, gender) using tools like pyannote and Vox-Profile, and assessing speech quality via SNR and MOS scores. ASR transcriptions are generated using multiple models (SenseVoice, TeleASR, Whisper), refined through a ROVER-like voting mechanism, and further enhanced by an LLM for context-aware corrections. This multi-stage approach ensures high accuracy and rich metadata for downstream tasks.

Quick Start & Requirements

To set up, clone the repository and create a Conda environment with Python 3.10 and pynini==2.1.5. Install remaining dependencies via pip install -r requirements.txt. ASR inference requires specific command-line execution with model checkpoints and configurations. TTS inference involves downloading models from Hugging Face (ASLP-lab/WSYue-TTS) and using provided Python scripts with dependencies like funasr and torchaudio. GPU acceleration is recommended for inference.

Highlighted Details

Dataset Size: 21,800 hours of Cantonese speech across ten domains (Storytelling, Entertainment, Vlog, etc.).
Annotations: Includes audio path, duration, text confidence, speaker ID, SNR, DNSMOS, age, gender, and character-level timestamps.
Benchmarks: WSYue-eval provides comprehensive benchmarks for ASR and zero-shot Cantonese TTS, including diverse real-world scenarios and linguistic phenomena.
ASR Leaderboard: Features performance metrics for various ASR models on WSYue-eval, including Conformer-Yue, Paraformer, and Whisper variants.

Maintenance & Community

The project is associated with the ASLP@NPU group. Contact emails (lhli@mail.nwpu.edu.cn, gzhao@mail.nwpu.edu.cn) are provided for inquiries. A WeChat discussion group is available via a QR code scan. No specific details on active contributors, sponsorships, or a public roadmap are present in the provided text.

Licensing & Compatibility

The provided README content does not explicitly state the license under which the dataset or code is released. This lack of information poses a significant adoption blocker, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not detail specific limitations, known bugs, or unsupported platforms. While the dataset is described as large-scale and multi-dimensional, potential gaps in coverage or specific linguistic phenomena not fully represented are not mentioned. The absence of explicit licensing information is a critical caveat for potential users.

WenetSpeech-Yue by ASLP-lab

Explore Similar Projects

pheme by PolyAI-LDN

speech-recognition-uk by egorsmkv

ASR-TTS-paper-daily by halsay

VoiceFlow-TTS by X-LANCE

SpeechGPT-2.0-preview by OpenMOSS

Fun-ASR by FunAudioLLM

FireRedTTS by FireRedTeam

zamia-speech by gooofy

parrots by shibing624

higgs-audio by boson-ai

metavoice-src by metavoiceio

GPT-SoVITS by RVC-Boss