Open-NLLB by gordicaleksa

Open-source effort for NLLB checkpoints, aiming for commercial use

created 1 year ago

457 stars

Top 67.1% on sourcepulse

Project Summary

This repository provides open-source checkpoints and training code for Meta's NLLB (No Language Left Behind) machine translation system, enabling commercial use of models supporting over 200 languages. It targets researchers and developers seeking high-quality, multilingual translation capabilities, particularly for low-resource languages, and aims to democratize AI by offering freely usable models.

How It Works

The project leverages Meta's NLLB architecture, which includes dense transformer models of varying sizes (600M to 3.3B parameters) and a Mixture-of-Experts (MoE) model (54.5B parameters). It utilizes a SentencePiece model (SPM-200) trained on 200+ languages for data encoding and provides comprehensive code for data mining, preparation, training, and inference. This approach allows for scalable, high-quality translation across a vast language spectrum.

Quick Start & Requirements

Installation and usage instructions are detailed in the INSTALL guide and the fairseq README.
Requires Python and dependencies managed by fairseq.
Hugging Face integration is available for dense models.

Highlighted Details

Open-sources NLLB-200 models (MoE 54.5B, Dense 3.3B, 1.3B) and distilled versions (1.3B, 600M).
Includes evaluation benchmarks: FLORES-200, NLLB-MD, Toxicity-200.
Provides code for LASER3 encoders and data mining pipelines.
Offers human evaluation guidelines and datasets (XSTS).

Maintenance & Community

Community engagement is encouraged via the "The AI Epiphany" Discord server and YouTube streams.
The project acknowledges language champions and data contributors.

Licensing & Compatibility

NLLB code and fairseq(-py) are MIT-licensed.
Note: The README states NLLB models are licensed under CC-BY-NC 4.0, which restricts commercial use. This contradicts the project's stated goal of releasing checkpoints for commercial purposes.

Limitations & Caveats

The primary caveat is the conflicting licensing information regarding model usage: the project aims for commercial use, but the models are explicitly stated to be under a CC-BY-NC 4.0 license, which prohibits commercial applications.

Health Check

Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 90 days