IndicTrans2 by AI4Bharat

Multilingual NMT model for 22 Indian languages

Created 2 years ago

387 stars

Top 74.1% on SourcePulse

Project Summary

IndicTrans2 provides open-source transformer-based multilingual NMT models for all 22 scheduled Indian languages, addressing the need for high-quality translation in low-resource Indic languages. It is designed for researchers and developers working on Indic language technologies.

How It Works

IndicTrans2 utilizes a script unification approach where feasible, enabling transfer learning through lexical sharing across languages. It supports five scripts: Perso-Arabic, Ol Chiki, Meitei, Latin, and Devanagari. The models are trained on a comprehensive dataset (BPCC) and augmented with back-translation data.

Quick Start & Requirements

Install dependencies: source install.sh
Requires Python >= 3.7, sentencepiece, and GNU parallel.
Official quick-start and documentation are available.

Highlighted Details

Supports 22 scheduled Indian languages, including multiple scripts for low-resource languages.
Releases training dataset (BPCC), back-translation data (BPCC-BT), models, and evaluation benchmarks (IN22).
Offers Fairseq and Hugging Face (HF) compatible models.
Long context variants (up to 2048 tokens) are available.

Maintenance & Community

Active development by AI4Bharat.
Links to paper, website, and demo are provided.

Licensing & Compatibility

Model checkpoints are released under the MIT license, permitting commercial use.
Training data has mixed licenses: CC0 for mined/back-translation data, CC-BY-4.0 for newly added seed corpora and evaluation sets.

Limitations & Caveats

The README mentions that the tokenizer for HF compatible models has been migrated to IndicTransToolkit and will be maintained separately.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days