WanJuan1.0  by opendatalab

Multimodal corpus for large model training (LLM/MLLM)

Created 2 years ago
566 stars

Top 56.8% on SourcePulse

GitHubView on GitHub
Project Summary

The WanJuan 1.0 dataset provides a comprehensive, 2TB multimodal corpus for training large language and multimodal models, with a focus on Chinese language data. It addresses the need for high-quality, diverse, and value-aligned data for advanced AI development, targeting researchers and engineers building sophisticated generative models.

How It Works

WanJuan 1.0 integrates text, image-text, and video data, sourced from diverse origins like web pages, books, and media. The dataset undergoes rigorous fine-grained cleaning, deduplication, and value alignment processes, combining rule-based and model-based filtering. This meticulous processing aims to enhance knowledge content, logical reasoning, and generalization capabilities in downstream models, while ensuring alignment with mainstream Chinese values.

Quick Start & Requirements

  • Access: Download links are provided for the full dataset. Specific scripts are available for downloading image data from URLs.
  • Format: Data is provided in unified jsonl format.
  • Resources: The total dataset size exceeds 2TB.

Highlighted Details

  • Multimodal Integration: Combines text (1TB+), image-text (140GB+), and video (900GB+) data.
  • Extensive Coverage: Spans fields like technology, literature, media, education, and law.
  • Value Alignment: Focuses on alignment with mainstream Chinese values through algorithmic and manual evaluation.
  • Proven Application: Used in training models like Intern Multimodal and Intern Puyu, demonstrating strong performance in generative tasks.

Maintenance & Community

  • Release: First released on August 14, 2023, with a security upgrade and further cleaning on October 20, 2023.
  • Contact: OpenDataLab@pjlab.org.cn for infringement claims.
  • Citation: Available via arXiv (2308.10755).

Licensing & Compatibility

  • License: CC BY 4.0.
  • Restrictions: Requires attribution, indication of modifications, and prohibits additional restrictions. Commercial use and linking with closed-source projects are permitted under the terms of CC BY 4.0.

Limitations & Caveats

Some subsets may be subject to different agreements; users must verify specific subset licenses. The dataset is primarily focused on Chinese language data, which may limit its direct applicability for models requiring extensive English-only training data.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.