Indic-BERT-v1  by AI4Bharat

Multilingual ALBERT model for Indian languages

created 5 years ago
287 stars

Top 92.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides Indic-BERT, an ALBERT-based multilingual language model pre-trained on 11 Indian languages and English. It aims to offer competitive performance with significantly fewer parameters than other multilingual models, targeting researchers and developers working with Indic languages. The project also introduces IndicGLUE, a benchmark suite for evaluating Natural Language Understanding (NLU) tasks in these languages.

How It Works

Indic-BERT is built upon the ALBERT architecture, a parameter-efficient variant of BERT. It leverages a novel corpus of approximately 9 billion tokens spanning 12 languages. The model's advantage lies in its specialized training on Indic languages and its significantly reduced parameter count, enabling more efficient deployment and fine-tuning.

Quick Start & Requirements

  • Install via pip: pip3 install transformers sentencepiece
  • Load model and tokenizer using Hugging Face transformers:
    from transformers import AutoModel, AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
    model = AutoModel.from_pretrained('ai4bharat/indic-bert')
    
  • For accent preservation during tokenization, use keep_accents=True.
  • Code can be run on GPU, TPU, or Google Colab. Colab fine-tuning notebook available.
  • Full setup requires cloning the repo and installing requirements: git clone https://github.com/AI4Bharat/indic-bert && cd indic-bert && sudo pip3 install -r requirements.txt
  • TPU setup requires specific environment variables and pytorch-xla.
  • Official documentation and downloads are available via the IndicBERT Website.

Highlighted Details

  • Pre-trained on ~9 billion tokens across 12 Indic languages and English.
  • Achieves comparable or better performance than larger multilingual models with ~10x fewer parameters.
  • Introduces IndicGLUE, a benchmark suite for 5 NLU tasks (News Category Classification, NER, Headline Prediction, Wikipedia Section Title Prediction, Cloze-style QA) across 11 Indian languages.
  • Provides evaluation results against mBERT and XLM-R on IndicGLUE and additional tasks.

Maintenance & Community

This project is part of the AI4Bharat initiative, a volunteer effort. Key contributors are listed, and contact information is provided for feedback. The README recommends using the newer IndicBERT v2 repository for the latest improvements.

Licensing & Compatibility

The code and models are released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README explicitly recommends the newer IndicBERT v2 repository for improved performance and implementation. This v1 repository may be considered legacy. All models are restricted to a max_seq_length of 128.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 6 days ago
Feedback? Help us improve.