IndicLLMSuite by AI4Bharat

Creating large-scale datasets for Indic language LLMs

Created 2 years ago

411 stars

Top 70.6% on SourcePulse

Project Summary

Summary AI4Bharat's IndicLLMSuite provides a blueprint and comprehensive datasets for creating pre-training and fine-tuning large language models (LLMs) for 22 Indic languages. It addresses the critical gap in high-quality, large-scale data resources for these languages, aiming to significantly advance LLM development and accessibility within India.

How It Works The suite comprises two primary dataset collections: Sangraha, a 251 billion token pre-training corpus, and IndicAlign, a 74.7 million instruction fine-tuning dataset. Sangraha is curated from verified web scrapes, OCR'd PDFs, transcribed media, existing multilingual corpora, and synthetic translations. IndicAlign includes instruction-following and toxic alignment data, generated via aggregation, translation, synthetic methods, and crowd-sourcing. Data curation is powered by robust pipelines: Setu for cleaning, filtering, and deduplication; Setu-translate for large-scale, structure-preserving translations; and Setu-transliterate for transliterations. This multi-pronged approach ensures data quality and linguistic diversity.

Quick Start & Requirements Data for Sangraha and IndicAlign is available for download via Huggingface. The repository provides code and pipelines (Setu, Setu-translate, Setu-transliterate) for data curation. Setup for these pipelines requires dependencies such as Apache Spark, IndicTrans2, and IndicXlit. Specific Python versions and hardware requirements (e.g., GPU) are not detailed but are implied for large-scale data processing and potential model training. The README indicates that setup instructions for the data pipelines are available within the repository.

Highlighted Details

Awarded ACL 2024 Outstanding Paper Award.
Largest pre-training (Sangraha, 251B tokens) and instruction fine-tuning (IndicAlign, 74.7M pairs) dataset collection for 22 Indic languages.
Includes specialized datasets like IndicAlign-Toxic for responsible AI alignment.
Features Setu, a comprehensive data cleaning, filtering, and deduplication pipeline specifically for Indic languages.

Maintenance & Community The project is associated with AI4Bharat and lists multiple authors in its academic citation. No specific community channels (e.g., Discord, Slack), active maintainer information, or roadmap links are provided in the README.

Licensing & Compatibility The repository's license is not specified. This omission presents a significant barrier to adoption, particularly for commercial use or integration into closed-source projects, requiring clarification.

Limitations & Caveats Some supplementary resources and pipeline setup instructions are marked as "[COMING SOON!!!]". The most critical caveat is the absence of a defined software license, which hinders clear understanding of usage rights and compatibility. The project focuses on dataset creation artifacts rather than a runnable LLM or training framework.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days