FireRedASR by FireRedTeam

Open-source ASR models for Mandarin, dialects, and English

Created 11 months ago

1,713 stars

Top 24.6% on SourcePulse

Project Summary

FireRedASR provides open-source, industrial-grade Automatic Speech Recognition (ASR) models for Mandarin, Chinese dialects, and English. It offers two variants: FireRedASR-LLM for state-of-the-art performance leveraging LLMs, and FireRedASR-AED for a balance of performance and efficiency. The project targets researchers and developers needing high-accuracy speech-to-text capabilities, particularly for Mandarin, with demonstrated SOTA results on public benchmarks.

How It Works

FireRedASR features two architectures: FireRedASR-LLM uses an Encoder-Adapter-LLM framework to integrate large language models for enhanced speech interaction. FireRedASR-AED employs an Attention-based Encoder-Decoder (AED) architecture, optimized for efficiency and serving as a robust speech representation module. This dual approach allows users to select models based on performance or efficiency requirements.

Quick Start & Requirements

Install: Clone the repository and install dependencies via pip install -r requirements.txt within a Python 3.10 Conda environment.
Prerequisites: Requires model weights downloaded from Hugging Face. FireRedASR-LLM-L also requires Qwen2-7B-Instruct weights. Audio must be converted to 16kHz, 16-bit PCM WAV format using ffmpeg.
Setup: Basic setup involves cloning, environment creation, and dependency installation.
Usage: Examples and command-line scripts (speech2text.py) are provided for inference. Python API usage is also demonstrated.
Links: Paper, Model, Blog

Highlighted Details

Achieves SOTA on public Mandarin ASR benchmarks (e.g., 0.55% CER on aishell1 with FireRedASR-AED-L).
Demonstrates strong performance on Chinese dialects (KeSpeech) and English (LibriSpeech).
Offers specialized capability for singing lyrics recognition.
Models range from 1.1B to 8.3B parameters, with LLM variants requiring significant computational resources.

Maintenance & Community

The project is actively developed, with recent releases in early 2025. Key contributors are listed as Kai-Tuo Xu, Feng-Long Xie, Xu Tang, and Yao Hu. Further community engagement channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is released under an unspecified license. The README mentions dependencies on other open-source works like Qwen2-7B-Instruct, WeNet, and Speech-Transformer, which may have their own licensing terms that could affect commercial use or closed-source linking.

Limitations & Caveats

FireRedASR-AED supports audio inputs up to 60s; longer inputs may cause issues. FireRedASR-LLM supports up to 30s, with behavior for longer inputs unknown. Batch beam search with FireRedASR-LLM may require similar utterance lengths to avoid repetition.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

52 stars in the last 30 days