Generative ASR error correction via cross-modal fusion
Top 99.3% on sourcepulse
This project provides a framework for generative Automatic Speech Recognition (ASR) error correction by fusing the Whisper audio encoder with the LLaMA language model decoder. It targets researchers and practitioners in ASR and NLP, offering improved accuracy over traditional methods by leveraging cross-modal integration.
How It Works
The core approach involves using the Whisper encoder to extract acoustic features from audio input. These features are then integrated into the LLaMA decoder, which is prompted with n-best hypotheses from an ASR system. This cross-modal fusion allows the LLM to predict a more accurate sentence, significantly improving Word Error Rate (WERR) by a claimed 28.83% to 37.66%. The system is designed for parameter efficiency, with only 7.97M trainable parameters.
Quick Start & Requirements
conda env create -f environment.yml
or pip install -r requirements.txt
.demo.py
.python training/WL-S.py
with arguments for learning rate, GPU count, and data paths.Highlighted Details
Maintenance & Community
lit-llama
, stanford_alpaca
, and Whisper
.Licensing & Compatibility
Limitations & Caveats
The README mentions a paper is "[YET]" to be published, indicating potential for changes. Specific hardware requirements beyond GPU usage for training are not detailed.
1 year ago
1 day