IndicTrans2  by AI4Bharat

Multilingual NMT model for 22 Indian languages

created 2 years ago
343 stars

Top 81.8% on sourcepulse

GitHubView on GitHub
Project Summary

IndicTrans2 provides open-source transformer-based multilingual NMT models for all 22 scheduled Indian languages, addressing the need for high-quality translation in low-resource Indic languages. It is designed for researchers and developers working on Indic language technologies.

How It Works

IndicTrans2 utilizes a script unification approach where feasible, enabling transfer learning through lexical sharing across languages. It supports five scripts: Perso-Arabic, Ol Chiki, Meitei, Latin, and Devanagari. The models are trained on a comprehensive dataset (BPCC) and augmented with back-translation data.

Quick Start & Requirements

  • Install dependencies: source install.sh
  • Requires Python >= 3.7, sentencepiece, and GNU parallel.
  • Official quick-start and documentation are available.

Highlighted Details

  • Supports 22 scheduled Indian languages, including multiple scripts for low-resource languages.
  • Releases training dataset (BPCC), back-translation data (BPCC-BT), models, and evaluation benchmarks (IN22).
  • Offers Fairseq and Hugging Face (HF) compatible models.
  • Long context variants (up to 2048 tokens) are available.

Maintenance & Community

  • Active development by AI4Bharat.
  • Links to paper, website, and demo are provided.

Licensing & Compatibility

  • Model checkpoints are released under the MIT license, permitting commercial use.
  • Training data has mixed licenses: CC0 for mined/back-translation data, CC-BY-4.0 for newly added seed corpora and evaluation sets.

Limitations & Caveats

The README mentions that the tokenizer for HF compatible models has been migrated to IndicTransToolkit and will be maintained separately.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
34 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.