Fun-ASR  by FunAudioLLM

Advanced speech recognition toolkit for global audio

Created 2 months ago
891 stars

Top 40.4% on SourcePulse

GitHubView on GitHub
Project Summary

Fun-ASR is an end-to-end large speech recognition model from Tongyi Lab, designed for high-precision, multi-language transcription. It targets developers and researchers needing robust ASR capabilities, especially in challenging environments or for specialized domains, offering benefits like low-latency real-time transcription and extensive dialect/accent support.

How It Works

The system employs an end-to-end architecture trained on tens of millions of hours of real speech data. It features specialized optimizations for far-field, high-noise scenarios, achieving up to 93% accuracy. Novel aspects include deep support for 7 Chinese dialects and 26 regional accents, alongside recognition for 31 languages with mixed-language capabilities, and enhanced performance for music background lyric transcription.

Quick Start & Requirements

Installation involves cloning the repository (https://github.com/FunAudioLLM/Fun-ASR.git), navigating into the directory, and running pip install -r requirements.txt. GPU acceleration (e.g., cuda:0) is recommended for inference. Links to online demos are available via ModelScope and Huggingface Spaces.

Highlighted Details

  • Supports 31 languages, with extensive coverage of Chinese dialects (7) and regional accents (26).
  • Achieves up to 93% accuracy in far-field, high-noise environments.
  • Includes specialized modules for music background lyric recognition and rap speech.
  • Offers low-latency real-time transcription capabilities.

Maintenance & Community

Community interaction and online experiences are facilitated through ModelScope Community Space and Huggingface Spaces. The project is associated with Tongyi Lab.

Licensing & Compatibility

The provided README does not specify the software license. This omission requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The project has outstanding TODO items including support for returning timestamps, speaker diarization, and model training. The current focus is primarily on inference.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
12
Star History
97 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.1%
10k
End-to-end speech processing toolkit for various speech tasks
Created 8 years ago
Updated 1 day ago
Feedback? Help us improve.