pycorrector  by shibing624

Toolkit for text error correction, supports multiple models for Chinese

created 7 years ago
6,102 stars

Top 8.6% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit provides a comprehensive solution for Chinese text error correction, targeting developers and researchers working with Chinese NLP. It offers a unified platform to evaluate and apply various models for correcting spelling, phonetic, and grammatical errors, significantly improving text quality.

How It Works

The project implements a diverse range of models for text correction, including statistical methods like KenLM (n-gram language models) and deep learning approaches such as Seq2Seq, T5, BERT variants (MacBERT, ERNIE), and large language models (ChatGLM3, Qwen2.5). This multi-model strategy allows for comparison and selection of the best-performing approach based on specific error types and performance requirements.

Quick Start & Requirements

  • Install via pip: pip install -U pycorrector
  • Dependencies: Python 3.8+, PyTorch, PaddlePaddle (for ERNIE), ModelScope (for MuCGECBart). GPU is recommended for deep learning models.
  • Official Docs: 📖文档/Docs
  • HuggingFace Demo: HuggingFace demo

Highlighted Details

  • Supports multiple models: KenLM, DeepContext, ConvSeq2Seq, T5, ERNIE_CSC, MacBERT, MuCGECBart, NaSGECBart, ChatGLM3, LLaMA, Qwen2.5.
  • Includes evaluation scripts and benchmarks on datasets like SIGHAN-2015, EC-LAW, and MCSC.
  • Offers customizability for language models, confusion sets, and proper names.
  • Provides command-line interface for batch processing.

Maintenance & Community

  • Active development with recent updates including Qwen2.5 models.
  • Community engagement via Github Issues and Discussions.
  • Contact: xuming624@qq.com, WeChat: xuming624.

Licensing & Compatibility

  • Licensed under Apache License 2.0, permitting free commercial use.
  • Requires attribution with a link to the project and license.

Limitations & Caveats

  • The KenLM model (2.8GB) can be resource-intensive for systems with limited memory.
  • MuCGECBart model testing was specifically done on Python 3.8.19, and other dependency versions might cause issues.
Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
3
Star History
167 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.