Multilingual NMT model for 22 Indian languages
Top 81.8% on sourcepulse
IndicTrans2 provides open-source transformer-based multilingual NMT models for all 22 scheduled Indian languages, addressing the need for high-quality translation in low-resource Indic languages. It is designed for researchers and developers working on Indic language technologies.
How It Works
IndicTrans2 utilizes a script unification approach where feasible, enabling transfer learning through lexical sharing across languages. It supports five scripts: Perso-Arabic, Ol Chiki, Meitei, Latin, and Devanagari. The models are trained on a comprehensive dataset (BPCC) and augmented with back-translation data.
Quick Start & Requirements
source install.sh
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README mentions that the tokenizer for HF compatible models has been migrated to IndicTransToolkit and will be maintained separately.
3 months ago
1 day