Discover and explore top open-source AI tools and projects—updated daily.
ASLP-labLarge-scale Cantonese speech dataset and processing pipeline
Top 99.1% on SourcePulse
WenetSpeech-Yue provides a large-scale, multi-dimensionally annotated Cantonese speech corpus (21,800 hours) and the associated WenetSpeech-Pipe data processing pipeline. It addresses the scarcity of high-quality Cantonese speech resources for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) research. The project benefits researchers and developers by offering a comprehensive dataset with rich annotations, including speaker attributes, speech quality metrics, and character-level timestamps, alongside evaluation benchmarks.
How It Works
The WenetSpeech-Pipe pipeline systematically processes raw audio data by segmenting long recordings, annotating speaker attributes (age, gender) using tools like pyannote and Vox-Profile, and assessing speech quality via SNR and MOS scores. ASR transcriptions are generated using multiple models (SenseVoice, TeleASR, Whisper), refined through a ROVER-like voting mechanism, and further enhanced by an LLM for context-aware corrections. This multi-stage approach ensures high accuracy and rich metadata for downstream tasks.
Quick Start & Requirements
To set up, clone the repository and create a Conda environment with Python 3.10 and pynini==2.1.5. Install remaining dependencies via pip install -r requirements.txt. ASR inference requires specific command-line execution with model checkpoints and configurations. TTS inference involves downloading models from Hugging Face (ASLP-lab/WSYue-TTS) and using provided Python scripts with dependencies like funasr and torchaudio. GPU acceleration is recommended for inference.
Highlighted Details
Maintenance & Community
The project is associated with the ASLP@NPU group. Contact emails (lhli@mail.nwpu.edu.cn, gzhao@mail.nwpu.edu.cn) are provided for inquiries. A WeChat discussion group is available via a QR code scan. No specific details on active contributors, sponsorships, or a public roadmap are present in the provided text.
Licensing & Compatibility
The provided README content does not explicitly state the license under which the dataset or code is released. This lack of information poses a significant adoption blocker, especially for commercial use or integration into closed-source projects.
Limitations & Caveats
The README does not detail specific limitations, known bugs, or unsupported platforms. While the dataset is described as large-scale and multi-dimensional, potential gaps in coverage or specific linguistic phenomena not fully represented are not mentioned. The absence of explicit licensing information is a critical caveat for potential users.
1 month ago
Inactive
metavoiceio
RVC-Boss