FASPell  by iqiyi

Chinese spell checker for detecting/correcting substitution errors

created 5 years ago
1,222 stars

Top 32.9% on sourcepulse

GitHubView on GitHub
Project Summary

FASPell is a Chinese spell checker designed to detect and correct substitution errors in simplified and traditional Chinese text. It targets researchers and developers working with noisy Chinese user-generated text, offering state-of-the-art performance as of early 2019.

How It Works

FASPell employs a DAE-Decoder paradigm, leveraging a fine-tuned BERT masked language model to generate candidate corrections. It then uses a filtering mechanism that combines character similarity (based on visual and phonological features) with confidence scores from the language model to rank and select the best correction. This approach allows for fast, adaptable, and powerful spell checking.

Quick Start & Requirements

  • Install via pip install -r requirements.txt (requires Python 3.6, TensorFlow >= 1.7, matplotlib, tqdm).
  • Java and apted.jar are required for tree edit distance similarity calculation.
  • Data preparation involves downloading and converting provided datasets to specific formats.
  • Official paper: https://www.aclweb.org/anthology/D19-5522

Highlighted Details

  • Achieved 76.2% precision and 67.1% recall on character-level detection on the SIGHAN15 test set.
  • Utilizes character features from Kanji Database Project (visual) and Unihan Database (phonological).
  • Supports both string edit distance and tree edit distance for similarity computation.
  • Requires a multi-stage training process involving masked LM pre-training and fine-tuning.

Maintenance & Community

  • The project was published in 2019; no recent activity is indicated.
  • No community links (Discord, Slack) are provided.

Licensing & Compatibility

  • Licensed under GNU General Public License v3.0 (GPLv3).
  • GPLv3 is a strong copyleft license, potentially restricting commercial use or integration into closed-source projects without adherence to its terms.

Limitations & Caveats

  • The "state-of-the-art" performance is as of early 2019, and newer models may surpass it.
  • The setup and training process, particularly tuning the CSD filters, is complex and time-consuming.
  • Requires specific data formatting and dependencies, including older versions of Python and TensorFlow.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.