indicnlp_catalog  by AI4Bharat

NLP resource catalog for Indic languages

created 6 years ago
606 stars

Top 54.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive, collaborative catalog of Natural Language Processing (NLP) resources for Indic languages. It aims to consolidate datasets, models, libraries, and evaluation benchmarks, benefiting researchers, developers, and anyone working on NLP for the Indian subcontinent.

How It Works

The catalog is structured by NLP task and resource type, providing links and descriptions for each entry. Contributions are encouraged via pull requests or issues, following a specified format to ensure consistency. It highlights significant advancements and emerging trends in Indic language NLP, such as the rise of large-scale corpora and models supporting a wide range of languages, including low-resource ones.

Quick Start & Requirements

This is a catalog, not a runnable software package. Accessing the resources listed will require individual setup based on each resource's specific requirements.

Highlighted Details

  • Features extensive coverage of datasets and models for over 20 Indic languages, including low-resource languages like Bodo and Khasi.
  • Includes major initiatives like the Universal Language Contribution API (ULCA) and large-scale corpora such as IndicCorp (9 billion tokens) and Samanantar (50 million sentence pairs).
  • Lists numerous libraries and tools for Indic NLP tasks, including tokenization, transliteration, and NER.
  • Provides links to evaluation benchmarks like AI4Bharat IndicGLUE and GLUECoS for code-mixed data.

Maintenance & Community

The project is a community effort, with contributions from various institutions and individuals, including AI4Bharat, BUET CSE NLP, and IIT Patna. Users can engage through GitHub issues and pull requests.

Licensing & Compatibility

The repository itself is open-source, but the licensing of individual resources listed within the catalog varies. Users must consult the specific licenses of each dataset, model, or tool they intend to use.

Limitations & Caveats

Many resources are still classified as open issues, indicating that the catalog is a work in progress. The usability and quality of individual resources depend on their respective creators and are not directly managed by this repository.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
4 more.

awesome-nlp by keon

0.1%
17k
Curated list of NLP resources
created 9 years ago
updated 1 year ago
Feedback? Help us improve.