Mega-ASR  by xzf-thu

Robust ASR for real-world acoustic challenges

Created 1 week ago

New!

642 stars

Top 51.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Mega-ASR addresses the critical need for robust Automatic Speech Recognition (ASR) in real-world, "in-the-wild" acoustic conditions where standard models falter. It targets engineers, researchers, and power users requiring reliable speech transcription across diverse and challenging environments, offering significant accuracy gains and reduced failure modes compared to state-of-the-art alternatives.

How It Works

This project introduces a foundation ASR model trained systematically on 7 atomic and 54 compound acoustic scenarios, simulating real-world noise, far-field speech, obstructions, and transmission artifacts. It employs an Acoustic-to-Semantic Progressive Supervised Fine-Tuning (A2S-SFT) strategy, curriculum-training the encoder and aligner on progressively harder data, followed by LLM fine-tuning and end-to-end joint optimization. Further refinement uses WER-gated policy learning (DG-WGPO) via Reinforcement Learning, prioritizing token-level acoustic refinement for low-WER samples and sentence-level semantic reconstruction for high-WER samples to combat hallucinations and omissions.

Quick Start & Requirements

Installation involves cloning the repository, creating a conda environment with Python 3.10, activating it, and installing dependencies via pip install -r requirements.txt.

  • Primary install/run command:
    git clone https://github.com/xzf-thu/Mega-ASR.git
    cd Mega-ASR
    conda create -n mega-asr python=3.10 -y
    conda activate mega-asr
    pip install -r requirements.txt
    
  • Non-default prerequisites: conda, pip, git. GPU acceleration is typically required for ASR tasks. wandb is used for experiment tracking.
  • Links:

Highlighted Details

  • Achieves up to nearly 30% gains over leading open and closed-source SOTA models in challenging acoustic environments.
  • Designed as a single model covering 7 atomic and 54 compound acoustic scenarios for comprehensive real-world robustness.
  • Substantially reduces common ASR failure modes, including hallucinations, empty outputs, and dropped utterances.
  • Features a router mechanism to dynamically activate LoRA weights based on audio input characteristics.

Maintenance & Community

The project has seen recent releases in May 2026, including model weights, inference/training codebase, technical report, and a new benchmark (Voices-in-the-Wild-Bench). Upcoming releases are planned for RL code, WebUI optimization, and data processing pipelines. No explicit community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

Mega-ASR is released under the Apache-2.0 License, permitting broad usage, including commercial applications.

Limitations & Caveats

The model exhibits a slight degradation in basic recognition capability due to its training on inherently high-WER data, though this is mitigated by a routing mechanism. The DG-WGPO reinforcement learning module is slated for future release. Specific hardware requirements (e.g., CUDA version, GPU memory) are not detailed.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
16
Star History
644 stars in the last 11 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.