indicnlp_catalog by AI4Bharat

NLP resource catalog for Indic languages

Created 6 years ago

637 stars

Top 51.4% on SourcePulse

Project Summary

This repository serves as a comprehensive, collaborative catalog of Natural Language Processing (NLP) resources for Indic languages. It aims to consolidate datasets, models, libraries, and evaluation benchmarks, benefiting researchers, developers, and anyone working on NLP for the Indian subcontinent.

How It Works

The catalog is structured by NLP task and resource type, providing links and descriptions for each entry. Contributions are encouraged via pull requests or issues, following a specified format to ensure consistency. It highlights significant advancements and emerging trends in Indic language NLP, such as the rise of large-scale corpora and models supporting a wide range of languages, including low-resource ones.

Quick Start & Requirements

This is a catalog, not a runnable software package. Accessing the resources listed will require individual setup based on each resource's specific requirements.

Highlighted Details

Features extensive coverage of datasets and models for over 20 Indic languages, including low-resource languages like Bodo and Khasi.
Includes major initiatives like the Universal Language Contribution API (ULCA) and large-scale corpora such as IndicCorp (9 billion tokens) and Samanantar (50 million sentence pairs).
Lists numerous libraries and tools for Indic NLP tasks, including tokenization, transliteration, and NER.
Provides links to evaluation benchmarks like AI4Bharat IndicGLUE and GLUECoS for code-mixed data.

Maintenance & Community

The project is a community effort, with contributions from various institutions and individuals, including AI4Bharat, BUET CSE NLP, and IIT Patna. Users can engage through GitHub issues and pull requests.

Licensing & Compatibility

The repository itself is open-source, but the licensing of individual resources listed within the catalog varies. Users must consult the specific licenses of each dataset, model, or tool they intend to use.

Limitations & Caveats

Many resources are still classified as open issues, indicating that the catalog is a work in progress. The usability and quality of individual resources depend on their respective creators and are not directly managed by this repository.

indicnlp_catalog by AI4Bharat

Explore Similar Projects

awesome-hungarian-nlp by oroszgy

Portuguese-NLP by ajdavidl

awesome-japanese-nlp-resources by taishi-i

German-NLP by adbar

NLP_bahasa_resources by louisowen6

Awesome-LLMs-Datasets by lmmlzn

awesome-bangla by banglakit

pororo by kakaobrain

Awesome-Chinese-NLP by crownpku

unilm by microsoft

awesome-nlp by keon

HanLP by hankcs