JioNLP by dongrixinyu

NLP toolkit for Chinese text preprocessing and parsing

Created 5 years ago

3,794 stars

Top 12.7% on SourcePulse

Project Summary

JioNLP is a comprehensive Python toolkit designed for Chinese Natural Language Processing (NLP) preprocessing and parsing. It aims to streamline common NLP tasks for developers and researchers, offering a wide array of functions for data cleaning, entity recognition, text manipulation, and more, with a focus on accuracy, efficiency, and ease of use.

How It Works

JioNLP provides a modular collection of specialized functions, often leveraging regular expressions and curated dictionaries for specific parsing and extraction tasks. Its approach emphasizes providing granular control over preprocessing steps, allowing users to select and apply individual tools or combine them for complex pipelines. The library also includes utilities for data augmentation and evaluation, such as the MELLM algorithm for LLM assessment.

Quick Start & Requirements

Install via pip: pip install jionlp
Requires Python >= 3.6.
For MELLM evaluation, download norm_score.json and max_score.json from the test data (password: jmbo).
Official documentation and demos are available via the GitHub repository.

Highlighted Details

Extensive feature set covering text cleaning, time/location/ID parsing, phonetic conversion, data augmentation (back-translation, homophone substitution), and NER assistance.
Includes specialized loaders for Chinese idioms,歇后语 (xiēhòuyǔ - four-character idioms), and geographical dictionaries.
Offers tools for LLM evaluation datasets and the MELLM algorithm for unsupervised LLM assessment.
Provides utilities for file I/O, logging, and timing code execution.

Maintenance & Community

The project is actively maintained, with recent updates including LLM evaluation datasets and modifications to dictionary content. Users can engage with the community via a WeChat official account ("JioNLP") for updates and group access. Suggestions and bug reports are encouraged through GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The README mentions a plan to simplify the chinese_idiom_loader by removing definitions, which might affect users relying on the full dictionary. Some MELLM evaluation components require downloading password-protected files. The absence of a clearly stated license could be a concern for some users.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days