Discover and explore top open-source AI tools and projects—updated daily.
xzf-thuRobust ASR for real-world acoustic challenges
New!
Top 51.4% on SourcePulse
Summary
Mega-ASR addresses the critical need for robust Automatic Speech Recognition (ASR) in real-world, "in-the-wild" acoustic conditions where standard models falter. It targets engineers, researchers, and power users requiring reliable speech transcription across diverse and challenging environments, offering significant accuracy gains and reduced failure modes compared to state-of-the-art alternatives.
How It Works
This project introduces a foundation ASR model trained systematically on 7 atomic and 54 compound acoustic scenarios, simulating real-world noise, far-field speech, obstructions, and transmission artifacts. It employs an Acoustic-to-Semantic Progressive Supervised Fine-Tuning (A2S-SFT) strategy, curriculum-training the encoder and aligner on progressively harder data, followed by LLM fine-tuning and end-to-end joint optimization. Further refinement uses WER-gated policy learning (DG-WGPO) via Reinforcement Learning, prioritizing token-level acoustic refinement for low-WER samples and sentence-level semantic reconstruction for high-WER samples to combat hallucinations and omissions.
Quick Start & Requirements
Installation involves cloning the repository, creating a conda environment with Python 3.10, activating it, and installing dependencies via pip install -r requirements.txt.
git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR
conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt
conda, pip, git. GPU acceleration is typically required for ASR tasks. wandb is used for experiment tracking.Highlighted Details
Maintenance & Community
The project has seen recent releases in May 2026, including model weights, inference/training codebase, technical report, and a new benchmark (Voices-in-the-Wild-Bench). Upcoming releases are planned for RL code, WebUI optimization, and data processing pipelines. No explicit community channels (e.g., Discord, Slack) are listed.
Licensing & Compatibility
Mega-ASR is released under the Apache-2.0 License, permitting broad usage, including commercial applications.
Limitations & Caveats
The model exhibits a slight degradation in basic recognition capability due to its training on inherently high-WER data, though this is mitigated by a routing mechanism. The DG-WGPO reinforcement learning module is slated for future release. Specific hardware requirements (e.g., CUDA version, GPU memory) are not detailed.
2 days ago
Inactive
yl4579