omnilingual-asr  by facebookresearch

Multilingual speech recognition for over 1600 languages

Created 3 months ago
2,694 stars

Top 17.2% on SourcePulse

GitHubView on GitHub
Project Summary

Omnilingual ASR is an open-source speech recognition system designed for broad accessibility, supporting over 1,600 languages, including hundreds previously uncovered by any ASR technology. It aims to make speech technology more inclusive and adaptable for communities and researchers worldwide by enabling new languages to be added with minimal data through scalable zero-shot learning.

How It Works

The system employs a flexible model family combining Wave2Vec (W2V), Connectionist Temporal Classification (CTC), and Large Language Model (LLM) architectures. Its core innovation lies in scalable zero-shot learning, allowing rapid adaptation to new languages using only a few paired examples, thereby circumventing the need for extensive, specialized datasets. This approach enhances inclusivity and adaptability for diverse linguistic communities.

Quick Start & Requirements

  • Installation: pip install omnilingual-asr or uv add omnilingual-asr.
  • Prerequisites: libsndfile is required for audio support (e.g., brew install libsndfile on macOS).
  • Links: Huggingface Demo, Huggingface Dataset (facebook/omnilingual-asr-corpus), Paper, Blogpost, Documentation, Quick Start, Inference Guide.

Highlighted Details

  • Supports over 1,600 languages, significantly expanding ASR coverage.
  • The 7B-LLM-ASR model achieves sub-10% character error rates (CER) for 78% of supported languages.
  • Offers a range of models (300M to 7B parameters) with varying VRAM and inference speed trade-offs.
  • Facilitates the addition of new languages with minimal paired examples.

Maintenance & Community

The project is attributed to the "Omnilingual ASR Team" with numerous listed authors. Specific community channels (e.g., Discord, Slack) or explicit roadmap links are not detailed in the provided README.

Licensing & Compatibility

The code and models are released under the Apache 2.0 license, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

Currently, the inference pipeline only accepts audio files shorter than 40 seconds. Support for transcribing unlimited-length audio files is planned for a future release.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
7
Star History
87 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.