FASPell by iqiyi

Chinese spell checker for detecting/correcting substitution errors

Created 6 years ago

1,224 stars

Top 31.8% on SourcePulse

Project Summary

FASPell is a Chinese spell checker designed to detect and correct substitution errors in simplified and traditional Chinese text. It targets researchers and developers working with noisy Chinese user-generated text, offering state-of-the-art performance as of early 2019.

How It Works

FASPell employs a DAE-Decoder paradigm, leveraging a fine-tuned BERT masked language model to generate candidate corrections. It then uses a filtering mechanism that combines character similarity (based on visual and phonological features) with confidence scores from the language model to rank and select the best correction. This approach allows for fast, adaptable, and powerful spell checking.

Quick Start & Requirements

Install via pip install -r requirements.txt (requires Python 3.6, TensorFlow >= 1.7, matplotlib, tqdm).
Java and apted.jar are required for tree edit distance similarity calculation.
Data preparation involves downloading and converting provided datasets to specific formats.
Official paper: https://www.aclweb.org/anthology/D19-5522

Highlighted Details

Achieved 76.2% precision and 67.1% recall on character-level detection on the SIGHAN15 test set.
Utilizes character features from Kanji Database Project (visual) and Unihan Database (phonological).
Supports both string edit distance and tree edit distance for similarity computation.
Requires a multi-stage training process involving masked LM pre-training and fine-tuning.

Maintenance & Community

The project was published in 2019; no recent activity is indicated.
No community links (Discord, Slack) are provided.

Licensing & Compatibility

Licensed under GNU General Public License v3.0 (GPLv3).
GPLv3 is a strong copyleft license, potentially restricting commercial use or integration into closed-source projects without adherence to its terms.

Limitations & Caveats

The "state-of-the-art" performance is as of early 2019, and newer models may surpass it.
The setup and training process, particularly tuning the CSD filters, is complex and time-consuming.
Requires specific data formatting and dependencies, including older versions of Python and TensorFlow.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

YAYI2 by wenge-research

Chinese LLM for research, base and chat versions, 30B parameters

Created 2 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

2 more.

tokenmonster by alasdairforsythe

Subword tokenizer and vocabulary trainer for multiple languages

Created 2 years ago

Updated 1 year ago

rime-fast-xhup by boomker

Rime config for fast double-pinyin input with auxiliary codes

Created 2 years ago

Updated 1 day ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

awesome-japanese-nlp-resources by taishi-i

Curated list of NLP resources for Japanese

Created 3 years ago

Updated 2 days ago

MacBERT by ymcui

Chinese NLP pre-trained language model research paper

Created 5 years ago

Updated 7 months ago

Automatic-Corpus-Generation by wdimmy

Automatic corpus generation for Chinese spelling correction

Created 7 years ago

Updated 6 years ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI).

ngram by EurekaLabsAI

N-gram language model for character-level name generation

Created 1 year ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

neuspell by neuspell

Neural spelling correction toolkit

Created 5 years ago

Updated 2 years ago

KoELECTRA by monologg

Pretrained ELECTRA model for Korean language tasks

Created 5 years ago

Updated 2 years ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

CBLUE by CBLUEbenchmark

Benchmark for Chinese biomedical language understanding

Created 4 years ago

Updated 2 years ago

awesome-bangla by banglakit

Bangla NLP tools, datasets, and resources

Created 9 years ago

Updated 9 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

pycorrector by shibing624

Toolkit for text error correction, supports multiple models for Chinese

Created 8 years ago

Updated 1 month ago

Feedback? Help us improve.