Discover and explore top open-source AI tools and projects—updated daily.
Chinese NLP corpus for pre-training and language model tasks
Top 5.2% on SourcePulse
This repository provides a large-scale, diverse collection of Chinese language corpora for Natural Language Processing (NLP) research and development. It aims to address the scarcity of readily available, high-quality Chinese datasets, benefiting researchers, students, and practitioners working on Chinese NLP tasks.
How It Works
The project curates and distributes several distinct Chinese datasets, including Wikipedia articles, news articles, question-answering pairs, community discussions, and parallel translation data. Each dataset is processed, deduplicated, and often provided in JSON format with clear schema definitions, facilitating easy integration into NLP pipelines and model training.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is maintained by Bright Xu. Contributions are welcomed via email, with incentives offered for adopted datasets. A related project, CLUE Benchmark, is also linked, suggesting an active ecosystem.
Licensing & Compatibility
The datasets are generally available for research purposes. Specific licensing details are not explicitly stated in the README, but the project's goal is to promote the development of Chinese NLP, implying a permissive stance for academic use. Commercial use would require clarification.
Limitations & Caveats
The datasets are primarily snapshots from 2018-2019, meaning they may not reflect the most current language usage. Some datasets are split into train/validation/test sets, but test sets are not always provided for download, requiring users to submit results for evaluation on a separate platform.
1 week ago
1 week