Chinese video-language dataset and benchmarks for pre-training
Top 90.0% on sourcepulse
Youku-mPLUG provides a large-scale Chinese video-language dataset and benchmarks for pre-training and evaluating multimodal models. It targets researchers and developers working on video understanding and generation tasks, offering a substantial resource for advancing Chinese multimodal AI capabilities.
How It Works
The project introduces Youku-mPLUG, a 10 million video-text dataset curated from the Youku platform, emphasizing safety, diversity, and quality across 20 super categories and 45 specific categories. It also provides three downstream benchmarks: Video Category Prediction, Video-Text Retrieval, and Video Captioning. The core approach involves pre-training large language models (like GPT-3 1.3B/2.7B) on this dataset and fine-tuning them for specific video-language tasks.
Quick Start & Requirements
conda env create -f environment.yml
, conda activate youku
, pip install megatron_util==1.3.0 -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
. For caption evaluation: apt-get install default-jre
.megatron_util
, Java Runtime Environment for evaluation. Pre-trained checkpoints for GPT-3 1.3B/2.7B and BloomZ-7B models are required and available via Modelscope and HuggingFace.Highlighted Details
Maintenance & Community
The project is associated with authors from various institutions, indicating academic backing. Links to community resources like Discord or Slack are not explicitly provided in the README.
Licensing & Compatibility
The README does not specify a license for the dataset or the code. Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
A specific bug in megatron_util
requires manual replacement of an initialize.py
file post-installation. The dataset and models are primarily focused on Chinese content. Licensing details for widespread adoption are absent.
1 year ago
1 day