Multilingual ALBERT model for Indian languages
Top 92.4% on sourcepulse
This repository provides Indic-BERT, an ALBERT-based multilingual language model pre-trained on 11 Indian languages and English. It aims to offer competitive performance with significantly fewer parameters than other multilingual models, targeting researchers and developers working with Indic languages. The project also introduces IndicGLUE, a benchmark suite for evaluating Natural Language Understanding (NLU) tasks in these languages.
How It Works
Indic-BERT is built upon the ALBERT architecture, a parameter-efficient variant of BERT. It leverages a novel corpus of approximately 9 billion tokens spanning 12 languages. The model's advantage lies in its specialized training on Indic languages and its significantly reduced parameter count, enabling more efficient deployment and fine-tuning.
Quick Start & Requirements
pip3 install transformers sentencepiece
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
model = AutoModel.from_pretrained('ai4bharat/indic-bert')
keep_accents=True
.git clone https://github.com/AI4Bharat/indic-bert && cd indic-bert && sudo pip3 install -r requirements.txt
pytorch-xla
.Highlighted Details
Maintenance & Community
This project is part of the AI4Bharat initiative, a volunteer effort. Key contributors are listed, and contact information is provided for feedback. The README recommends using the newer IndicBERT v2 repository for the latest improvements.
Licensing & Compatibility
The code and models are released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The README explicitly recommends the newer IndicBERT v2 repository for improved performance and implementation. This v1 repository may be considered legacy. All models are restricted to a max_seq_length
of 128.
2 years ago
1 week