nlp_chinese_corpus by brightmart

Chinese NLP corpus for pre-training and language model tasks

Created 7 years ago

9,856 stars

Top 5.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

This repository provides a large-scale, diverse collection of Chinese language corpora for Natural Language Processing (NLP) research and development. It aims to address the scarcity of readily available, high-quality Chinese datasets, benefiting researchers, students, and practitioners working on Chinese NLP tasks.

How It Works

The project curates and distributes several distinct Chinese datasets, including Wikipedia articles, news articles, question-answering pairs, community discussions, and parallel translation data. Each dataset is processed, deduplicated, and often provided in JSON format with clear schema definitions, facilitating easy integration into NLP pipelines and model training.

Quick Start & Requirements

Installation: Data is primarily accessed via direct download links (Google Drive, Baidu Netdisk). No specific installation command is required for the data itself.
Prerequisites: Access to download links and sufficient storage space for the datasets (ranging from hundreds of MB to several GB per dataset).
Resources: Download times depend on internet speed and dataset size. Processing and training will require standard NLP development environments (Python, deep learning frameworks).
Links:
- Zenodo DOI
- CLUE Benchmark (related project)

Highlighted Details

Includes over 1 million structured Chinese Wikipedia articles.
Features 2.5 million news articles with metadata (keywords, source, time).
Offers 4.1 million high-quality community Q&A pairs with upvoting information.
Provides 5.2 million Chinese-English parallel sentences for translation tasks.

Maintenance & Community

The project is maintained by Bright Xu. Contributions are welcomed via email, with incentives offered for adopted datasets. A related project, CLUE Benchmark, is also linked, suggesting an active ecosystem.

Licensing & Compatibility

The datasets are generally available for research purposes. Specific licensing details are not explicitly stated in the README, but the project's goal is to promote the development of Chinese NLP, implying a permissive stance for academic use. Commercial use would require clarification.

Limitations & Caveats

The datasets are primarily snapshots from 2018-2019, meaning they may not reflect the most current language usage. Some datasets are split into train/validation/test sets, but test sets are not always provided for download, requiring users to submit results for evaluation on a separate platform.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days