Discover and explore top open-source AI tools and projects—updated daily.
Automatic corpus generation for Chinese spelling correction
Top 89.8% on SourcePulse
This repository provides scripts for automatically generating Chinese sentences with marked spelling errors, targeting researchers and developers in Chinese Spelling Checking (CSC). It offers a hybrid approach to corpus generation, enabling the creation of custom datasets for training and evaluating CSC models, with a pre-generated dataset of over 270,000 sentences.
How It Works
The project employs a hybrid methodology combining OCR-based and ASR-based techniques to introduce realistic spelling errors into Chinese sentences. This approach allows for the automatic generation of error locations and corrections without manual annotation, creating a valuable resource for CSC research. A byproduct of this method is the construction of a comprehensive confusionset, detailing visually or phonologically similar character variants.
Quick Start & Requirements
python main_train.py
for training a provided BiLSTM model.python main_test.py
for testing.Highlighted Details
Maintenance & Community
The project was released in 2018 for EMNLP. Contact is available via email (wangdimmy@gmail.com). The dataset and confusionset are noted to be continuously updated.
Licensing & Compatibility
The repository does not explicitly state a license. The provided datasets are for research purposes. Compatibility with commercial use is not specified.
Limitations & Caveats
The provided BiLSTM model is a basic implementation and may require further optimization. The project relies on specific versions of libraries (Pytorch 0.4), which might pose compatibility challenges with newer environments. The README mentions that the datasets are continuously updated, but no specific update schedule or mechanism is detailed.
6 years ago
Inactive