Automatic-Corpus-Generation  by wdimmy

Automatic corpus generation for Chinese spelling correction

Created 7 years ago
295 stars

Top 89.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides scripts for automatically generating Chinese sentences with marked spelling errors, targeting researchers and developers in Chinese Spelling Checking (CSC). It offers a hybrid approach to corpus generation, enabling the creation of custom datasets for training and evaluating CSC models, with a pre-generated dataset of over 270,000 sentences.

How It Works

The project employs a hybrid methodology combining OCR-based and ASR-based techniques to introduce realistic spelling errors into Chinese sentences. This approach allows for the automatic generation of error locations and corrections without manual annotation, creating a valuable resource for CSC research. A byproduct of this method is the construction of a comprehensive confusionset, detailing visually or phonologically similar character variants.

Quick Start & Requirements

  • Installation: Requires Python 3.5+, Pytorch 0.4, NumPy, BeautifulSoup, pytesseract, OpenCV, and Kaldi.
  • Training: Use python main_train.py for training a provided BiLSTM model.
  • Testing: Use python main_test.py for testing.
  • Datasets: Pre-generated datasets and confusionsets are available. Traditional Chinese datasets (SIGHAN Bake-off 2013, 2014, 2015) are provided in Simplified Chinese via OpenCC translation.

Highlighted Details

  • Includes a generated dataset of 271,329 sentences with 381,962 total errors.
  • Provides a confusionset for character variants, useful for CSC research.
  • Offers scripts for both OCR-based and ASR-based error generation methods.
  • Includes a baseline PyTorch BiLSTM model for CSC.

Maintenance & Community

The project was released in 2018 for EMNLP. Contact is available via email (wangdimmy@gmail.com). The dataset and confusionset are noted to be continuously updated.

Licensing & Compatibility

The repository does not explicitly state a license. The provided datasets are for research purposes. Compatibility with commercial use is not specified.

Limitations & Caveats

The provided BiLSTM model is a basic implementation and may require further optimization. The project relies on specific versions of libraries (Pytorch 0.4), which might pose compatibility challenges with newer environments. The README mentions that the datasets are continuously updated, but no specific update schedule or mechanism is detailed.

Health Check
Last Commit

6 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Luis Capelo Luis Capelo(Cofounder of Lightning AI).

Gramformer by PrithivirajDamodaran

0.2%
2k
Grammar correction framework for NLP text
Created 4 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

pycorrector by shibing624

0.2%
6k
Toolkit for text error correction, supports multiple models for Chinese
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.