MNBVC  by esbatmop

Vast Chinese corpus for LLM training

created 2 years ago
3,922 stars

Top 12.7% on sourcepulse

GitHubView on GitHub
Project Summary

MNBVC is a massive, continuously growing Chinese language corpus project aiming to rival the scale of data used for training large language models like ChatGPT. It targets researchers and developers working with Chinese NLP, providing a diverse dataset encompassing mainstream and niche cultural content, including "Mars language" (火星文). The project's primary benefit is offering an unprecedentedly large and varied Chinese text resource for model training and research.

How It Works

The project collects raw text data from the Chinese internet, aiming for 253TB, with a current progress of 57090GB. Data is sourced from various formats (txt, json, jsonl, parquet) and undergoes minimal processing, such as HTML/XML to text conversion. To avoid copyright disputes and maintain low-profile operations, the dataset intentionally omits indexing and classification of specific content. Data is distributed via P2P and Baidu Netdisk, with compression passwords provided.

Quick Start & Requirements

Highlighted Details

  • Current size: 57090GB, with a target of 253TB.
  • Includes diverse content: mainstream, niche, and "Mars language" (火星文).
  • Provides numerous specialized tools for data cleaning, code crawling, and multimodal processing.
  • Actively seeking contributors for various tasks, including OCR, QA alignment, and testing.

Maintenance & Community

The project is driven by the "MOP里屋社区" and the "MNBVC Team." They are actively recruiting volunteers for data cleaning, OCR, QA alignment, and testing roles, with contact via email (MNBVC@253874.net).

Licensing & Compatibility

The README does not explicitly state a license. Given the nature of scraped internet data and the project's emphasis on low-profile operations to avoid copyright issues, commercial use or linking with closed-source projects may require careful legal review.

Limitations & Caveats

The project explicitly states it lacks the capacity for copyright review of its data sources. Users are urged to refrain from discussing the index or specific content within the dataset to avoid copyright disputes. The data undergoes only "rough processing."

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
100 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.