Indic-BERT-v1 by AI4Bharat

Multilingual ALBERT model for Indian languages

Created 5 years ago

292 stars

Top 90.5% on SourcePulse

Project Summary

This repository provides Indic-BERT, an ALBERT-based multilingual language model pre-trained on 11 Indian languages and English. It aims to offer competitive performance with significantly fewer parameters than other multilingual models, targeting researchers and developers working with Indic languages. The project also introduces IndicGLUE, a benchmark suite for evaluating Natural Language Understanding (NLU) tasks in these languages.

How It Works

Indic-BERT is built upon the ALBERT architecture, a parameter-efficient variant of BERT. It leverages a novel corpus of approximately 9 billion tokens spanning 12 languages. The model's advantage lies in its specialized training on Indic languages and its significantly reduced parameter count, enabling more efficient deployment and fine-tuning.

Quick Start & Requirements

Install via pip: pip3 install transformers sentencepiece

Load model and tokenizer using Hugging Face transformers:

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
model = AutoModel.from_pretrained('ai4bharat/indic-bert')

For accent preservation during tokenization, use keep_accents=True.
Code can be run on GPU, TPU, or Google Colab. Colab fine-tuning notebook available.
Full setup requires cloning the repo and installing requirements: git clone https://github.com/AI4Bharat/indic-bert && cd indic-bert && sudo pip3 install -r requirements.txt
TPU setup requires specific environment variables and pytorch-xla.
Official documentation and downloads are available via the IndicBERT Website.

Highlighted Details

Pre-trained on ~9 billion tokens across 12 Indic languages and English.
Achieves comparable or better performance than larger multilingual models with ~10x fewer parameters.
Introduces IndicGLUE, a benchmark suite for 5 NLU tasks (News Category Classification, NER, Headline Prediction, Wikipedia Section Title Prediction, Cloze-style QA) across 11 Indian languages.
Provides evaluation results against mBERT and XLM-R on IndicGLUE and additional tasks.

Maintenance & Community

This project is part of the AI4Bharat initiative, a volunteer effort. Key contributors are listed, and contact information is provided for feedback. The README recommends using the newer IndicBERT v2 repository for the latest improvements.

Licensing & Compatibility

The code and models are released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README explicitly recommends the newer IndicBERT v2 repository for improved performance and implementation. This v1 repository may be considered legacy. All models are restricted to a max_seq_length of 128.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days