Automatic-Corpus-Generation by wdimmy

Automatic corpus generation for Chinese spelling correction

Created 7 years ago

294 stars

Top 90.2% on SourcePulse

Project Summary

This repository provides scripts for automatically generating Chinese sentences with marked spelling errors, targeting researchers and developers in Chinese Spelling Checking (CSC). It offers a hybrid approach to corpus generation, enabling the creation of custom datasets for training and evaluating CSC models, with a pre-generated dataset of over 270,000 sentences.

How It Works

The project employs a hybrid methodology combining OCR-based and ASR-based techniques to introduce realistic spelling errors into Chinese sentences. This approach allows for the automatic generation of error locations and corrections without manual annotation, creating a valuable resource for CSC research. A byproduct of this method is the construction of a comprehensive confusionset, detailing visually or phonologically similar character variants.

Quick Start & Requirements

Installation: Requires Python 3.5+, Pytorch 0.4, NumPy, BeautifulSoup, pytesseract, OpenCV, and Kaldi.
Training: Use python main_train.py for training a provided BiLSTM model.
Testing: Use python main_test.py for testing.
Datasets: Pre-generated datasets and confusionsets are available. Traditional Chinese datasets (SIGHAN Bake-off 2013, 2014, 2015) are provided in Simplified Chinese via OpenCC translation.

Highlighted Details

Includes a generated dataset of 271,329 sentences with 381,962 total errors.
Provides a confusionset for character variants, useful for CSC research.
Offers scripts for both OCR-based and ASR-based error generation methods.
Includes a baseline PyTorch BiLSTM model for CSC.

Maintenance & Community

The project was released in 2018 for EMNLP. Contact is available via email (wangdimmy@gmail.com). The dataset and confusionset are noted to be continuously updated.

Licensing & Compatibility

The repository does not explicitly state a license. The provided datasets are for research purposes. Compatibility with commercial use is not specified.

Limitations & Caveats

The provided BiLSTM model is a basic implementation and may require further optimization. The project relies on specific versions of libraries (Pytorch 0.4), which might pose compatibility challenges with newer environments. The README mentions that the datasets are continuously updated, but no specific update schedule or mechanism is detailed.

Health Check

Last Commit

6 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days