MNBVC by esbatmop

Vast Chinese corpus for LLM training

Created 3 years ago

4,235 stars

Top 11.4% on SourcePulse

Project Summary

MNBVC is a massive, continuously growing Chinese language corpus project aiming to rival the scale of data used for training large language models like ChatGPT. It targets researchers and developers working with Chinese NLP, providing a diverse dataset encompassing mainstream and niche cultural content, including "Mars language" (火星文). The project's primary benefit is offering an unprecedentedly large and varied Chinese text resource for model training and research.

How It Works

The project collects raw text data from the Chinese internet, aiming for 253TB, with a current progress of 57090GB. Data is sourced from various formats (txt, json, jsonl, parquet) and undergoes minimal processing, such as HTML/XML to text conversion. To avoid copyright disputes and maintain low-profile operations, the dataset intentionally omits indexing and classification of specific content. Data is distributed via P2P and Baidu Netdisk, with compression passwords provided.

Quick Start & Requirements

Download: Data is available via P2P (using provided keys and links) or Baidu Netdisk.
Prerequisites: Python for associated cleaning and crawling tools.
Resources: Requires significant disk space for the ~57TB dataset.
Links:
- Hugging Face (cleaned data): https://huggingface.co/datasets/liwu/MNBVC
- GitHub Repo: https://github.com/esbatmop/MNBVC

Highlighted Details

Current size: 57090GB, with a target of 253TB.
Includes diverse content: mainstream, niche, and "Mars language" (火星文).
Provides numerous specialized tools for data cleaning, code crawling, and multimodal processing.
Actively seeking contributors for various tasks, including OCR, QA alignment, and testing.

Maintenance & Community

The project is driven by the "MOP里屋社区" and the "MNBVC Team." They are actively recruiting volunteers for data cleaning, OCR, QA alignment, and testing roles, with contact via email (MNBVC@253874.net).

Licensing & Compatibility

The README does not explicitly state a license. Given the nature of scraped internet data and the project's emphasis on low-profile operations to avoid copyright issues, commercial use or linking with closed-source projects may require careful legal review.

Limitations & Caveats

The project explicitly states it lacks the capacity for copyright review of its data sources. Users are urged to refrain from discussing the index or specific content within the dataset to avoid copyright disputes. The data undergoes only "rough processing."

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

32 stars in the last 30 days