WanJuan1.0  by opendatalab

Multimodal corpus for large model training (LLM/MLLM)

created 2 years ago
564 stars

Top 57.9% on sourcepulse

GitHubView on GitHub
Project Summary

The WanJuan 1.0 dataset provides a comprehensive, 2TB multimodal corpus for training large language and multimodal models, with a focus on Chinese language data. It addresses the need for high-quality, diverse, and value-aligned data for advanced AI development, targeting researchers and engineers building sophisticated generative models.

How It Works

WanJuan 1.0 integrates text, image-text, and video data, sourced from diverse origins like web pages, books, and media. The dataset undergoes rigorous fine-grained cleaning, deduplication, and value alignment processes, combining rule-based and model-based filtering. This meticulous processing aims to enhance knowledge content, logical reasoning, and generalization capabilities in downstream models, while ensuring alignment with mainstream Chinese values.

Quick Start & Requirements

  • Access: Download links are provided for the full dataset. Specific scripts are available for downloading image data from URLs.
  • Format: Data is provided in unified jsonl format.
  • Resources: The total dataset size exceeds 2TB.

Highlighted Details

  • Multimodal Integration: Combines text (1TB+), image-text (140GB+), and video (900GB+) data.
  • Extensive Coverage: Spans fields like technology, literature, media, education, and law.
  • Value Alignment: Focuses on alignment with mainstream Chinese values through algorithmic and manual evaluation.
  • Proven Application: Used in training models like Intern Multimodal and Intern Puyu, demonstrating strong performance in generative tasks.

Maintenance & Community

  • Release: First released on August 14, 2023, with a security upgrade and further cleaning on October 20, 2023.
  • Contact: OpenDataLab@pjlab.org.cn for infringement claims.
  • Citation: Available via arXiv (2308.10755).

Licensing & Compatibility

  • License: CC BY 4.0.
  • Restrictions: Requires attribution, indication of modifications, and prohibits additional restrictions. Commercial use and linking with closed-source projects are permitted under the terms of CC BY 4.0.

Limitations & Caveats

Some subsets may be subject to different agreements; users must verify specific subset licenses. The dataset is primarily focused on Chinese language data, which may limit its direct applicability for models requiring extensive English-only training data.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.