Vietnamese NLP toolkit for text processing tasks
Top 27.3% on sourcepulse
Underthesea is an open-source Python toolkit for Vietnamese Natural Language Processing, offering a comprehensive suite of pre-trained models and datasets for tasks like word segmentation, POS tagging, NER, and text classification. It aims to simplify NLP research and development for Vietnamese text, providing an easy-to-use API for quick integration of advanced NLP capabilities.
How It Works
Underthesea leverages a combination of traditional NLP techniques and deep learning models to provide its functionalities. For tasks like word segmentation and POS tagging, it offers flexible output formats and the ability to incorporate custom fixed words. More advanced features like dependency parsing, NER, and text classification can utilize deep learning models, requiring an additional installation (underthesea[deep]
). The toolkit also integrates with external libraries for language detection (FastText) and text-to-speech (vietTTS), enhancing its versatility.
Quick Start & Requirements
pip install underthesea
pip install underthesea[deep]
pip install underthesea[prompt]
and set OPENAI_API_KEY
.pip install underthesea[langdetect]
pip install underthesea[wow]
and underthesea download-model VIET_TTS_V0_4_1
Highlighted Details
Maintenance & Community
The project is actively maintained and encourages community contributions. Further details on contributing can be found in CONTRIBUTING.rst
.
Licensing & Compatibility
Licensed under GNU General Public License v3.0 (GPL-3.0). This strong copyleft license requires that any derivative works or larger works incorporating this code must also be made available under the same GPL-3.0 license, potentially restricting commercial use or integration into closed-source projects.
Limitations & Caveats
The GPL-3.0 license imposes significant obligations regarding the distribution of derivative works. Some advanced features, like dependency parsing and deep NER, require separate installation of deep learning dependencies.
3 months ago
1 day