nlp_chinese_corpus  by brightmart

Chinese NLP corpus for pre-training and language model tasks

Created 6 years ago
9,778 stars

Top 5.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a large-scale, diverse collection of Chinese language corpora for Natural Language Processing (NLP) research and development. It aims to address the scarcity of readily available, high-quality Chinese datasets, benefiting researchers, students, and practitioners working on Chinese NLP tasks.

How It Works

The project curates and distributes several distinct Chinese datasets, including Wikipedia articles, news articles, question-answering pairs, community discussions, and parallel translation data. Each dataset is processed, deduplicated, and often provided in JSON format with clear schema definitions, facilitating easy integration into NLP pipelines and model training.

Quick Start & Requirements

  • Installation: Data is primarily accessed via direct download links (Google Drive, Baidu Netdisk). No specific installation command is required for the data itself.
  • Prerequisites: Access to download links and sufficient storage space for the datasets (ranging from hundreds of MB to several GB per dataset).
  • Resources: Download times depend on internet speed and dataset size. Processing and training will require standard NLP development environments (Python, deep learning frameworks).
  • Links:

Highlighted Details

  • Includes over 1 million structured Chinese Wikipedia articles.
  • Features 2.5 million news articles with metadata (keywords, source, time).
  • Offers 4.1 million high-quality community Q&A pairs with upvoting information.
  • Provides 5.2 million Chinese-English parallel sentences for translation tasks.

Maintenance & Community

The project is maintained by Bright Xu. Contributions are welcomed via email, with incentives offered for adopted datasets. A related project, CLUE Benchmark, is also linked, suggesting an active ecosystem.

Licensing & Compatibility

The datasets are generally available for research purposes. Specific licensing details are not explicitly stated in the README, but the project's goal is to promote the development of Chinese NLP, implying a permissive stance for academic use. Commercial use would require clarification.

Limitations & Caveats

The datasets are primarily snapshots from 2018-2019, meaning they may not reflect the most current language usage. Some datasets are split into train/validation/test sets, but test sets are not always provided for download, requiring users to submit results for evaluation on a separate platform.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
18 stars in the last 30 days

Explore Similar Projects

Starred by Elvis Saravia Elvis Saravia(Founder of DAIR.AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

nlp-library by mihail911

0.1%
1k
NLP papers for practitioners
Created 8 years ago
Updated 5 years ago
Starred by Andrew Kane Andrew Kane(Author of pgvector), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
11 more.

xlnet by zihangdai

0.0%
6k
Language model research paper using generalized autoregressive pretraining
Created 6 years ago
Updated 2 years ago
Feedback? Help us improve.