Multimodal corpus for large model training (LLM/MLLM)
Top 57.9% on sourcepulse
The WanJuan 1.0 dataset provides a comprehensive, 2TB multimodal corpus for training large language and multimodal models, with a focus on Chinese language data. It addresses the need for high-quality, diverse, and value-aligned data for advanced AI development, targeting researchers and engineers building sophisticated generative models.
How It Works
WanJuan 1.0 integrates text, image-text, and video data, sourced from diverse origins like web pages, books, and media. The dataset undergoes rigorous fine-grained cleaning, deduplication, and value alignment processes, combining rule-based and model-based filtering. This meticulous processing aims to enhance knowledge content, logical reasoning, and generalization capabilities in downstream models, while ensuring alignment with mainstream Chinese values.
Quick Start & Requirements
jsonl
format.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Some subsets may be subject to different agreements; users must verify specific subset licenses. The dataset is primarily focused on Chinese language data, which may limit its direct applicability for models requiring extensive English-only training data.
1 year ago
1 week