Open-NLLB  by gordicaleksa

Open-source effort for NLLB checkpoints, aiming for commercial use

Created 2 years ago
481 stars

Top 63.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides open-source checkpoints and training code for Meta's NLLB (No Language Left Behind) machine translation system, enabling commercial use of models supporting over 200 languages. It targets researchers and developers seeking high-quality, multilingual translation capabilities, particularly for low-resource languages, and aims to democratize AI by offering freely usable models.

How It Works

The project leverages Meta's NLLB architecture, which includes dense transformer models of varying sizes (600M to 3.3B parameters) and a Mixture-of-Experts (MoE) model (54.5B parameters). It utilizes a SentencePiece model (SPM-200) trained on 200+ languages for data encoding and provides comprehensive code for data mining, preparation, training, and inference. This approach allows for scalable, high-quality translation across a vast language spectrum.

Quick Start & Requirements

  • Installation and usage instructions are detailed in the INSTALL guide and the fairseq README.
  • Requires Python and dependencies managed by fairseq.
  • Hugging Face integration is available for dense models.

Highlighted Details

  • Open-sources NLLB-200 models (MoE 54.5B, Dense 3.3B, 1.3B) and distilled versions (1.3B, 600M).
  • Includes evaluation benchmarks: FLORES-200, NLLB-MD, Toxicity-200.
  • Provides code for LASER3 encoders and data mining pipelines.
  • Offers human evaluation guidelines and datasets (XSTS).

Maintenance & Community

  • Community engagement is encouraged via the "The AI Epiphany" Discord server and YouTube streams.
  • The project acknowledges language champions and data contributors.

Licensing & Compatibility

  • NLLB code and fairseq(-py) are MIT-licensed.
  • Note: The README states NLLB models are licensed under CC-BY-NC 4.0, which restricts commercial use. This contradicts the project's stated goal of releasing checkpoints for commercial purposes.

Limitations & Caveats

The primary caveat is the conflicting licensing information regarding model usage: the project aims for commercial use, but the models are explicitly stated to be under a CC-BY-NC 4.0 license, which prohibits commercial applications.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
1 more.

spark-nlp by JohnSnowLabs

0.1%
4k
NLP library for scalable ML pipelines
Created 8 years ago
Updated 13 hours ago
Feedback? Help us improve.