pycorrector by shibing624

Toolkit for text error correction, supports multiple models for Chinese

Created 7 years ago

6,331 stars

Top 8.1% on SourcePulse

2 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

shizhediao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

This toolkit provides a comprehensive solution for Chinese text error correction, targeting developers and researchers working with Chinese NLP. It offers a unified platform to evaluate and apply various models for correcting spelling, phonetic, and grammatical errors, significantly improving text quality.

How It Works

The project implements a diverse range of models for text correction, including statistical methods like KenLM (n-gram language models) and deep learning approaches such as Seq2Seq, T5, BERT variants (MacBERT, ERNIE), and large language models (ChatGLM3, Qwen2.5). This multi-model strategy allows for comparison and selection of the best-performing approach based on specific error types and performance requirements.

Quick Start & Requirements

Install via pip: pip install -U pycorrector
Dependencies: Python 3.8+, PyTorch, PaddlePaddle (for ERNIE), ModelScope (for MuCGECBart). GPU is recommended for deep learning models.
Official Docs: 📖文档/Docs
HuggingFace Demo: HuggingFace demo

Highlighted Details

Supports multiple models: KenLM, DeepContext, ConvSeq2Seq, T5, ERNIE_CSC, MacBERT, MuCGECBart, NaSGECBart, ChatGLM3, LLaMA, Qwen2.5.
Includes evaluation scripts and benchmarks on datasets like SIGHAN-2015, EC-LAW, and MCSC.
Offers customizability for language models, confusion sets, and proper names.
Provides command-line interface for batch processing.

Maintenance & Community

Active development with recent updates including Qwen2.5 models.
Community engagement via Github Issues and Discussions.
Contact: xuming624@qq.com, WeChat: xuming624.

Licensing & Compatibility

Licensed under Apache License 2.0, permitting free commercial use.
Requires attribution with a link to the project and license.

Limitations & Caveats

The KenLM model (2.8GB) can be resource-intensive for systems with limited memory.
MuCGECBart model testing was specifically done on Python 3.8.19, and other dependency versions might cause issues.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

6

Star History

37 stars in the last 30 days

Explore Similar Projects

Awesome-Simultaneous-Translation by zhangshaolei1998

Paper list for simultaneous translation research

Created 3 years ago

Updated 1 year ago

SkyText-Chinese-GPT3 by SkyWorkAIGC

Chinese GPT3 pre-trained language model

Created 3 years ago

Updated 2 years ago

Sakuranotoki-Chinese by kono-dada

Game translation project for visual novel

Created 2 years ago

Updated 11 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

awesome-japanese-nlp-resources by taishi-i

Curated list of NLP resources for Japanese

Created 3 years ago

Updated 5 days ago

ChineseErrorCorrector by TW-NLP

Chinese text error correction models

Created 1 year ago

Updated 3 days ago

Automatic-Corpus-Generation by wdimmy

Automatic corpus generation for Chinese spelling correction

Created 7 years ago

Updated 6 years ago

ru_transformers by mgrankin

GPT-2 finetuning notebook for Russian language models

Created 6 years ago

Updated 5 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

neuspell by neuspell

Neural spelling correction toolkit

Created 5 years ago

Updated 2 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and

Luis Capelo

Luis Capelo(Cofounder of Lightning AI).

Gramformer by PrithivirajDamodaran

Grammar correction framework for NLP text

Created 4 years ago

Updated 2 years ago

FASPell by iqiyi

Chinese spell checker for detecting/correcting substitution errors

Created 6 years ago

Updated 3 years ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen).

bob-plugin-openai-translator by nextai-translator

Bob plugin for translation, polishing, and grammar correction using the OpenAI API

Created 2 years ago

Updated 1 week ago

Chinese-BERT-wwm by ymcui

Pre-trained language models for Chinese NLP tasks

Created 6 years ago

Updated 6 months ago

Feedback? Help us improve.