awesome-hungarian-nlp  by oroszgy

NLP resource list for Hungarian

created 8 years ago
253 stars

Top 99.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository is a curated list of open-source Natural Language Processing (NLP) resources specifically for the Hungarian language. It serves as a comprehensive catalog for researchers, developers, and students working with Hungarian text data, aiming to centralize tools, models, datasets, and learning materials.

How It Works

The list is organized into logical categories, covering the entire NLP pipeline from basic text processing (tokenization, morphology) to advanced tasks like named entity recognition, sentiment analysis, and machine translation. It highlights resources with features like ease of installation, commercial-friendly licenses, and availability of pre-trained models, providing quick indicators for adoption suitability.

Quick Start & Requirements

  • Installation and usage vary by resource; many are Python packages installable via pip (e.g., pip install huntoken, pip install huspacy).
  • Some tools may require specific dependencies like Java, Clojure, or Docker.
  • Access to large datasets like the Hungarian Webcorpus (billions of words) is noted.
  • Links to official documentation, demos, and Hugging Face datasets are provided for individual resources.

Highlighted Details

  • Extensive coverage of morphological analyzers and taggers, including emMorph, hunmorph, and hunpos.
  • A wide array of transformer models and LLMs specifically trained or adapted for Hungarian, such as huBERT, PULI-GPTrio, and SambaLingo-Hungarian-Base.
  • Numerous annotated and parallel corpora, including the massive Hungarian Webcorpus (over 1.48 billion words) and the Hunglish Corpus (120 million words).
  • Resources for sentiment analysis, named entity recognition, and syntactic parsing are well-represented.

Maintenance & Community

  • Maintained by György Orosz.
  • Links to a Slack channel (HuNLP Slack) and relevant academic groups (e.g., BME, RIL-MTA) are provided.

Licensing & Compatibility

  • Licenses vary widely across listed resources, with many noted as "Commercial-friendly" (🚀) or having permissive licenses (e.g., Open Content for Hungarian Webcorpus).
  • Users must verify individual licenses for compatibility with specific commercial or closed-source projects.

Limitations & Caveats

The list is a curated collection, and the quality, maintenance status, and ease of use can vary significantly between individual resources. Users should independently verify the suitability and current state of each tool or dataset.

Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
3
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.