FASPell  by iqiyi

Chinese spell checker for detecting/correcting substitution errors

Created 6 years ago
1,221 stars

Top 32.2% on SourcePulse

GitHubView on GitHub
Project Summary

FASPell is a Chinese spell checker designed to detect and correct substitution errors in simplified and traditional Chinese text. It targets researchers and developers working with noisy Chinese user-generated text, offering state-of-the-art performance as of early 2019.

How It Works

FASPell employs a DAE-Decoder paradigm, leveraging a fine-tuned BERT masked language model to generate candidate corrections. It then uses a filtering mechanism that combines character similarity (based on visual and phonological features) with confidence scores from the language model to rank and select the best correction. This approach allows for fast, adaptable, and powerful spell checking.

Quick Start & Requirements

  • Install via pip install -r requirements.txt (requires Python 3.6, TensorFlow >= 1.7, matplotlib, tqdm).
  • Java and apted.jar are required for tree edit distance similarity calculation.
  • Data preparation involves downloading and converting provided datasets to specific formats.
  • Official paper: https://www.aclweb.org/anthology/D19-5522

Highlighted Details

  • Achieved 76.2% precision and 67.1% recall on character-level detection on the SIGHAN15 test set.
  • Utilizes character features from Kanji Database Project (visual) and Unihan Database (phonological).
  • Supports both string edit distance and tree edit distance for similarity computation.
  • Requires a multi-stage training process involving masked LM pre-training and fine-tuning.

Maintenance & Community

  • The project was published in 2019; no recent activity is indicated.
  • No community links (Discord, Slack) are provided.

Licensing & Compatibility

  • Licensed under GNU General Public License v3.0 (GPLv3).
  • GPLv3 is a strong copyleft license, potentially restricting commercial use or integration into closed-source projects without adherence to its terms.

Limitations & Caveats

  • The "state-of-the-art" performance is as of early 2019, and newer models may surpass it.
  • The setup and training process, particularly tuning the CSD filters, is complex and time-consuming.
  • Requires specific data formatting and dependencies, including older versions of Python and TensorFlow.
Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.2%
600
Subword tokenizer and vocabulary trainer for multiple languages
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

pycorrector by shibing624

0.2%
6k
Toolkit for text error correction, supports multiple models for Chinese
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.