Vast Chinese corpus for LLM training
Top 12.7% on sourcepulse
MNBVC is a massive, continuously growing Chinese language corpus project aiming to rival the scale of data used for training large language models like ChatGPT. It targets researchers and developers working with Chinese NLP, providing a diverse dataset encompassing mainstream and niche cultural content, including "Mars language" (火星文). The project's primary benefit is offering an unprecedentedly large and varied Chinese text resource for model training and research.
How It Works
The project collects raw text data from the Chinese internet, aiming for 253TB, with a current progress of 57090GB. Data is sourced from various formats (txt, json, jsonl, parquet) and undergoes minimal processing, such as HTML/XML to text conversion. To avoid copyright disputes and maintain low-profile operations, the dataset intentionally omits indexing and classification of specific content. Data is distributed via P2P and Baidu Netdisk, with compression passwords provided.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is driven by the "MOP里屋社区" and the "MNBVC Team." They are actively recruiting volunteers for data cleaning, OCR, QA alignment, and testing roles, with contact via email (MNBVC@253874.net).
Licensing & Compatibility
The README does not explicitly state a license. Given the nature of scraped internet data and the project's emphasis on low-profile operations to avoid copyright issues, commercial use or linking with closed-source projects may require careful legal review.
Limitations & Caveats
The project explicitly states it lacks the capacity for copyright review of its data sources. Users are urged to refrain from discussing the index or specific content within the dataset to avoid copyright disputes. The data undergoes only "rough processing."
1 week ago
1 day