Discover and explore top open-source AI tools and projects—updated daily.
westlake-replLarge-scale multimodal micro-video dataset and recommendation code
Top 99.6% on SourcePulse
A large-scale, content-driven dataset for micro-video recommendation research, MicroLens provides raw multimodal data (text, audio, image, video) and user interactions. It targets researchers and engineers in recommender systems and multimodal AI, enabling the development and evaluation of advanced, context-aware recommendation models.
How It Works
The dataset comprises multiple versions (MicroLens-50k, -100k, -1M) with rich multimodal features and user-video interaction logs. This structure facilitates training recommendation models that leverage diverse content modalities, moving beyond traditional ID-based approaches. It supports research in content-driven recommendation, multimodal understanding, and fairness.
Quick Start & Requirements
https://recsys.westlake.edu.cn/MicroLens-50k-Dataset/, https://recsys.westlake.edu.cn/MicroLens-100k-Dataset/. MicroLens-1M available for WWW 2025 MIRC.Code/. Training/testing scripts include run_id.py, run_text.py, run_image.py, run_video.py.Data Generation/generate_cover_frames_lmdb.py provided.quick_download.txt for video downloads, MMRec framework integration.Highlighted Details
Maintenance & Community
The project actively develops and expands the dataset, releasing new versions and features. It has received attention from Google DeepMind and YouTube, evidenced by invited talks. The lab is hiring research personnel, indicating ongoing research activity. No direct community channels are listed.
Licensing & Compatibility
No explicit open-source license is stated. A "Caution" prohibits private modification and secondary distribution of the dataset, encouraging open-sourcing processing code or notifying authors of alterations. This suggests a restrictive usage policy, potentially impacting commercial or closed-source integration without explicit permission.
Limitations & Caveats
Dataset redistribution is restricted. Specific, older versions of Python, PyTorch, and CUDA are required. Preparation of LMDB files is necessary for certain model types, adding a setup step.
1 year ago
Inactive
salesforce