NLPDataSet  by liucongg

NLP dataset collection for research

created 4 years ago
1,061 stars

Top 36.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a curated collection of Chinese NLP datasets, primarily focused on Named Entity Recognition (NER), Reading Comprehension (RC), and Text Matching tasks. It aims to offer researchers and practitioners readily usable, cleaned, and standardized datasets for training and evaluating NLP models, with a particular emphasis on simplifying the data preparation process for Chinese language tasks.

How It Works

The project consolidates data from various online sources and competitions, performing simple rule-based cleaning and format standardization. Datasets are converted to common formats, such as BIO tagging for NER, and links to original sources or descriptions are provided. The primary benefit is the aggregation and preprocessing of diverse Chinese NLP datasets into a more accessible format, saving users significant time and effort in data collection and cleaning.

Quick Start & Requirements

  • Datasets are available via Baidu Cloud links with provided extraction codes.
  • No specific software installation is required to download and use the datasets.
  • Access to Baidu Cloud and sufficient storage space are the main requirements.

Highlighted Details

  • Comprehensive collection of 22 Chinese NER datasets, including CMeEE, CLUENER, and MSRA.
  • Aggregation of 16 Chinese text matching datasets, such as LCQMC, AFQMC, and Chinese-MNLI.
  • Curation of 9 Chinese extractive reading comprehension datasets, including DRCD, CMRC2018, and DuReader.
  • Datasets are cleaned with simple rules and standardized to formats like BIO tagging for NER.

Maintenance & Community

The project was last updated on June 16, 2022. It is maintained by "NJUST-TB". No community links or active development signals are present in the README.

Licensing & Compatibility

The datasets are explicitly stated to be for academic research purposes only and are not to be used for commercial activities.

Limitations & Caveats

The README notes that during BIO conversion for NER, long entities may overwrite short entities due to simple cleaning rules. The datasets are provided via Baidu Cloud, which may have regional access limitations.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
18 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.