FireRedASR  by FireRedTeam

Open-source ASR models for Mandarin, dialects, and English

Created 11 months ago
1,713 stars

Top 24.6% on SourcePulse

GitHubView on GitHub
Project Summary

FireRedASR provides open-source, industrial-grade Automatic Speech Recognition (ASR) models for Mandarin, Chinese dialects, and English. It offers two variants: FireRedASR-LLM for state-of-the-art performance leveraging LLMs, and FireRedASR-AED for a balance of performance and efficiency. The project targets researchers and developers needing high-accuracy speech-to-text capabilities, particularly for Mandarin, with demonstrated SOTA results on public benchmarks.

How It Works

FireRedASR features two architectures: FireRedASR-LLM uses an Encoder-Adapter-LLM framework to integrate large language models for enhanced speech interaction. FireRedASR-AED employs an Attention-based Encoder-Decoder (AED) architecture, optimized for efficiency and serving as a robust speech representation module. This dual approach allows users to select models based on performance or efficiency requirements.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt within a Python 3.10 Conda environment.
  • Prerequisites: Requires model weights downloaded from Hugging Face. FireRedASR-LLM-L also requires Qwen2-7B-Instruct weights. Audio must be converted to 16kHz, 16-bit PCM WAV format using ffmpeg.
  • Setup: Basic setup involves cloning, environment creation, and dependency installation.
  • Usage: Examples and command-line scripts (speech2text.py) are provided for inference. Python API usage is also demonstrated.
  • Links: Paper, Model, Blog

Highlighted Details

  • Achieves SOTA on public Mandarin ASR benchmarks (e.g., 0.55% CER on aishell1 with FireRedASR-AED-L).
  • Demonstrates strong performance on Chinese dialects (KeSpeech) and English (LibriSpeech).
  • Offers specialized capability for singing lyrics recognition.
  • Models range from 1.1B to 8.3B parameters, with LLM variants requiring significant computational resources.

Maintenance & Community

The project is actively developed, with recent releases in early 2025. Key contributors are listed as Kai-Tuo Xu, Feng-Long Xie, Xu Tang, and Yao Hu. Further community engagement channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is released under an unspecified license. The README mentions dependencies on other open-source works like Qwen2-7B-Instruct, WeNet, and Speech-Transformer, which may have their own licensing terms that could affect commercial use or closed-source linking.

Limitations & Caveats

FireRedASR-AED supports audio inputs up to 60s; longer inputs may cause issues. FireRedASR-LLM supports up to 30s, with behavior for longer inputs unknown. Batch beam search with FireRedASR-LLM may require similar utterance lengths to avoid repetition.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
3
Star History
52 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.2%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Feedback? Help us improve.