WanJuan1.0 by opendatalab

Multimodal corpus for large model training (LLM/MLLM)

Created 2 years ago

570 stars

Top 56.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jinze Bai

Research Scientist at Alibaba Qwen

Project Summary

The WanJuan 1.0 dataset provides a comprehensive, 2TB multimodal corpus for training large language and multimodal models, with a focus on Chinese language data. It addresses the need for high-quality, diverse, and value-aligned data for advanced AI development, targeting researchers and engineers building sophisticated generative models.

How It Works

WanJuan 1.0 integrates text, image-text, and video data, sourced from diverse origins like web pages, books, and media. The dataset undergoes rigorous fine-grained cleaning, deduplication, and value alignment processes, combining rule-based and model-based filtering. This meticulous processing aims to enhance knowledge content, logical reasoning, and generalization capabilities in downstream models, while ensuring alignment with mainstream Chinese values.

Quick Start & Requirements

Access: Download links are provided for the full dataset. Specific scripts are available for downloading image data from URLs.
Format: Data is provided in unified jsonl format.
Resources: The total dataset size exceeds 2TB.

Highlighted Details

Multimodal Integration: Combines text (1TB+), image-text (140GB+), and video (900GB+) data.
Extensive Coverage: Spans fields like technology, literature, media, education, and law.
Value Alignment: Focuses on alignment with mainstream Chinese values through algorithmic and manual evaluation.
Proven Application: Used in training models like Intern Multimodal and Intern Puyu, demonstrating strong performance in generative tasks.

Maintenance & Community

Release: First released on August 14, 2023, with a security upgrade and further cleaning on October 20, 2023.
Contact: OpenDataLab@pjlab.org.cn for infringement claims.
Citation: Available via arXiv (2308.10755).

Licensing & Compatibility

License: CC BY 4.0.
Restrictions: Requires attribution, indication of modifications, and prohibits additional restrictions. Commercial use and linking with closed-source projects are permitted under the terms of CC BY 4.0.

Limitations & Caveats

Some subsets may be subject to different agreements; users must verify specific subset licenses. The dataset is primarily focused on Chinese language data, which may limit its direct applicability for models requiring extensive English-only training data.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days