underthesea  by undertheseanlp

Vietnamese NLP toolkit for text processing tasks

created 8 years ago
1,563 stars

Top 27.3% on sourcepulse

GitHubView on GitHub
Project Summary

Underthesea is an open-source Python toolkit for Vietnamese Natural Language Processing, offering a comprehensive suite of pre-trained models and datasets for tasks like word segmentation, POS tagging, NER, and text classification. It aims to simplify NLP research and development for Vietnamese text, providing an easy-to-use API for quick integration of advanced NLP capabilities.

How It Works

Underthesea leverages a combination of traditional NLP techniques and deep learning models to provide its functionalities. For tasks like word segmentation and POS tagging, it offers flexible output formats and the ability to incorporate custom fixed words. More advanced features like dependency parsing, NER, and text classification can utilize deep learning models, requiring an additional installation (underthesea[deep]). The toolkit also integrates with external libraries for language detection (FastText) and text-to-speech (vietTTS), enhancing its versatility.

Quick Start & Requirements

  • Primary install: pip install underthesea
  • Deep learning features: pip install underthesea[deep]
  • Prompt-based text classification: pip install underthesea[prompt] and set OPENAI_API_KEY.
  • Language detection: pip install underthesea[langdetect]
  • Text-to-Speech: pip install underthesea[wow] and underthesea download-model VIET_TTS_V0_4_1
  • Official Docs: https://github.com/undertheseanlp/underthesea

Highlighted Details

  • Supports a wide range of NLP tasks: sentence segmentation, text normalization, word segmentation, POS tagging, chunking, dependency parsing, NER, text classification, sentiment analysis, language detection, and text-to-speech.
  • Offers both traditional and deep learning models for various tasks, with optional installations for advanced features.
  • Includes a variety of Vietnamese NLP datasets and provides commands to list and download them.
  • Features prompt-based text classification, allowing integration with LLMs.

Maintenance & Community

The project is actively maintained and encourages community contributions. Further details on contributing can be found in CONTRIBUTING.rst.

Licensing & Compatibility

Licensed under GNU General Public License v3.0 (GPL-3.0). This strong copyleft license requires that any derivative works or larger works incorporating this code must also be made available under the same GPL-3.0 license, potentially restricting commercial use or integration into closed-source projects.

Limitations & Caveats

The GPL-3.0 license imposes significant obligations regarding the distribution of derivative works. Some advanced features, like dependency parsing and deep NER, require separate installation of deep learning dependencies.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
50 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.