Fun-ASR  by FunAudioLLM

Advanced speech recognition toolkit for global audio

Created 3 weeks ago

New!

672 stars

Top 50.4% on SourcePulse

GitHubView on GitHub
Project Summary

Fun-ASR is an end-to-end large speech recognition model from Tongyi Lab, designed for high-precision, multi-language transcription. It targets developers and researchers needing robust ASR capabilities, especially in challenging environments or for specialized domains, offering benefits like low-latency real-time transcription and extensive dialect/accent support.

How It Works

The system employs an end-to-end architecture trained on tens of millions of hours of real speech data. It features specialized optimizations for far-field, high-noise scenarios, achieving up to 93% accuracy. Novel aspects include deep support for 7 Chinese dialects and 26 regional accents, alongside recognition for 31 languages with mixed-language capabilities, and enhanced performance for music background lyric transcription.

Quick Start & Requirements

Installation involves cloning the repository (https://github.com/FunAudioLLM/Fun-ASR.git), navigating into the directory, and running pip install -r requirements.txt. GPU acceleration (e.g., cuda:0) is recommended for inference. Links to online demos are available via ModelScope and Huggingface Spaces.

Highlighted Details

  • Supports 31 languages, with extensive coverage of Chinese dialects (7) and regional accents (26).
  • Achieves up to 93% accuracy in far-field, high-noise environments.
  • Includes specialized modules for music background lyric recognition and rap speech.
  • Offers low-latency real-time transcription capabilities.

Maintenance & Community

Community interaction and online experiences are facilitated through ModelScope Community Space and Huggingface Spaces. The project is associated with Tongyi Lab.

Licensing & Compatibility

The provided README does not specify the software license. This omission requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The project has outstanding TODO items including support for returning timestamps, speaker diarization, and model training. The current focus is primarily on inference.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
54
Star History
679 stars in the last 27 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
10k
End-to-end speech processing toolkit for various speech tasks
Created 8 years ago
Updated 3 weeks ago
Feedback? Help us improve.