Mega-ASR by xzf-thu

Robust ASR for real-world acoustic challenges

Created 1 month ago

1,063 stars

Top 34.8% on SourcePulse

Project Summary

Summary

Mega-ASR addresses the critical need for robust Automatic Speech Recognition (ASR) in real-world, "in-the-wild" acoustic conditions where standard models falter. It targets engineers, researchers, and power users requiring reliable speech transcription across diverse and challenging environments, offering significant accuracy gains and reduced failure modes compared to state-of-the-art alternatives.

How It Works

This project introduces a foundation ASR model trained systematically on 7 atomic and 54 compound acoustic scenarios, simulating real-world noise, far-field speech, obstructions, and transmission artifacts. It employs an Acoustic-to-Semantic Progressive Supervised Fine-Tuning (A2S-SFT) strategy, curriculum-training the encoder and aligner on progressively harder data, followed by LLM fine-tuning and end-to-end joint optimization. Further refinement uses WER-gated policy learning (DG-WGPO) via Reinforcement Learning, prioritizing token-level acoustic refinement for low-WER samples and sentence-level semantic reconstruction for high-WER samples to combat hallucinations and omissions.

Quick Start & Requirements

Installation involves cloning the repository, creating a conda environment with Python 3.10, activating it, and installing dependencies via pip install -r requirements.txt.

Primary install/run command:

git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR
conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt

Non-default prerequisites: conda, pip, git. GPU acceleration is typically required for ASR tasks. wandb is used for experiment tracking.
Links:
- Repository: https://github.com/xzf-thu/Mega-ASR.git
- Technical Report: https://arxiv.org/abs/2605.19833
- Model weights and datasets are available via Hugging Face and linked resources.

Highlighted Details

Achieves up to nearly 30% gains over leading open and closed-source SOTA models in challenging acoustic environments.
Designed as a single model covering 7 atomic and 54 compound acoustic scenarios for comprehensive real-world robustness.
Substantially reduces common ASR failure modes, including hallucinations, empty outputs, and dropped utterances.
Features a router mechanism to dynamically activate LoRA weights based on audio input characteristics.

Maintenance & Community

The project has seen recent releases in May 2026, including model weights, inference/training codebase, technical report, and a new benchmark (Voices-in-the-Wild-Bench). Upcoming releases are planned for RL code, WebUI optimization, and data processing pipelines. No explicit community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

Mega-ASR is released under the Apache-2.0 License, permitting broad usage, including commercial applications.

Limitations & Caveats

The model exhibits a slight degradation in basic recognition capability due to its training on inherently high-WER data, though this is mitigated by a routing mechanism. The DG-WGPO reinforcement learning module is slated for future release. Specific hardware requirements (e.g., CUDA version, GPU memory) are not detailed.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

91 stars in the last 30 days