NLP dataset collection for research
Top 36.2% on sourcepulse
This repository provides a curated collection of Chinese NLP datasets, primarily focused on Named Entity Recognition (NER), Reading Comprehension (RC), and Text Matching tasks. It aims to offer researchers and practitioners readily usable, cleaned, and standardized datasets for training and evaluating NLP models, with a particular emphasis on simplifying the data preparation process for Chinese language tasks.
How It Works
The project consolidates data from various online sources and competitions, performing simple rule-based cleaning and format standardization. Datasets are converted to common formats, such as BIO tagging for NER, and links to original sources or descriptions are provided. The primary benefit is the aggregation and preprocessing of diverse Chinese NLP datasets into a more accessible format, saving users significant time and effort in data collection and cleaning.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project was last updated on June 16, 2022. It is maintained by "NJUST-TB". No community links or active development signals are present in the README.
Licensing & Compatibility
The datasets are explicitly stated to be for academic research purposes only and are not to be used for commercial activities.
Limitations & Caveats
The README notes that during BIO conversion for NER, long entities may overwrite short entities due to simple cleaning rules. The datasets are provided via Baidu Cloud, which may have regional access limitations.
3 years ago
Inactive