NLPDataSet by liucongg

NLP dataset collection for research

Created 4 years ago

1,083 stars

Top 34.8% on SourcePulse

Project Summary

This repository provides a curated collection of Chinese NLP datasets, primarily focused on Named Entity Recognition (NER), Reading Comprehension (RC), and Text Matching tasks. It aims to offer researchers and practitioners readily usable, cleaned, and standardized datasets for training and evaluating NLP models, with a particular emphasis on simplifying the data preparation process for Chinese language tasks.

How It Works

The project consolidates data from various online sources and competitions, performing simple rule-based cleaning and format standardization. Datasets are converted to common formats, such as BIO tagging for NER, and links to original sources or descriptions are provided. The primary benefit is the aggregation and preprocessing of diverse Chinese NLP datasets into a more accessible format, saving users significant time and effort in data collection and cleaning.

Quick Start & Requirements

Datasets are available via Baidu Cloud links with provided extraction codes.
No specific software installation is required to download and use the datasets.
Access to Baidu Cloud and sufficient storage space are the main requirements.

Highlighted Details

Comprehensive collection of 22 Chinese NER datasets, including CMeEE, CLUENER, and MSRA.
Aggregation of 16 Chinese text matching datasets, such as LCQMC, AFQMC, and Chinese-MNLI.
Curation of 9 Chinese extractive reading comprehension datasets, including DRCD, CMRC2018, and DuReader.
Datasets are cleaned with simple rules and standardized to formats like BIO tagging for NER.

Maintenance & Community

The project was last updated on June 16, 2022. It is maintained by "NJUST-TB". No community links or active development signals are present in the README.

Licensing & Compatibility

The datasets are explicitly stated to be for academic research purposes only and are not to be used for commercial activities.

Limitations & Caveats

The README notes that during BIO conversion for NER, long entities may overwrite short entities due to simple cleaning rules. The datasets are provided via Baidu Cloud, which may have regional access limitations.

NLPDataSet by liucongg

Explore Similar Projects

Better-Ruozhiba by FunnySaltyFish

KeywordGacha by neavo

awesome-chinese-ner by taishan1994

GPT2-Summary by qingkongzhiqian

fastHan by fastnlp

CBLUE by CBLUEbenchmark

MNBVC by esbatmop

JioNLP by dongrixinyu

Chinese-Names-Corpus by wainshine

nlp_chinese_corpus by brightmart

Awesome-Chinese-NLP by crownpku

awesome-nlp by keon