Discover and explore top open-source AI tools and projects—updated daily.
AI4BharatCreating large-scale datasets for Indic language LLMs
Top 92.6% on SourcePulse
Summary AI4Bharat's IndicLLMSuite provides a blueprint and comprehensive datasets for creating pre-training and fine-tuning large language models (LLMs) for 22 Indic languages. It addresses the critical gap in high-quality, large-scale data resources for these languages, aiming to significantly advance LLM development and accessibility within India.
How It Works The suite comprises two primary dataset collections: Sangraha, a 251 billion token pre-training corpus, and IndicAlign, a 74.7 million instruction fine-tuning dataset. Sangraha is curated from verified web scrapes, OCR'd PDFs, transcribed media, existing multilingual corpora, and synthetic translations. IndicAlign includes instruction-following and toxic alignment data, generated via aggregation, translation, synthetic methods, and crowd-sourcing. Data curation is powered by robust pipelines: Setu for cleaning, filtering, and deduplication; Setu-translate for large-scale, structure-preserving translations; and Setu-transliterate for transliterations. This multi-pronged approach ensures data quality and linguistic diversity.
Quick Start & Requirements Data for Sangraha and IndicAlign is available for download via Huggingface. The repository provides code and pipelines (Setu, Setu-translate, Setu-transliterate) for data curation. Setup for these pipelines requires dependencies such as Apache Spark, IndicTrans2, and IndicXlit. Specific Python versions and hardware requirements (e.g., GPU) are not detailed but are implied for large-scale data processing and potential model training. The README indicates that setup instructions for the data pipelines are available within the repository.
Highlighted Details
Maintenance & Community The project is associated with AI4Bharat and lists multiple authors in its academic citation. No specific community channels (e.g., Discord, Slack), active maintainer information, or roadmap links are provided in the README.
Licensing & Compatibility The repository's license is not specified. This omission presents a significant barrier to adoption, particularly for commercial use or integration into closed-source projects, requiring clarification.
Limitations & Caveats Some supplementary resources and pipeline setup instructions are marked as "[COMING SOON!!!]". The most critical caveat is the absence of a defined software license, which hinders clear understanding of usage rights and compatibility. The project focuses on dataset creation artifacts rather than a runnable LLM or training framework.
1 year ago
Inactive
shm007g
e-p-armstrong
instructlab