Discover and explore top open-source AI tools and projects—updated daily.
Multimodal LLM pre-training and fine-tuning
Top 95.9% on SourcePulse
Open-Qwen2VL provides an open-source framework for training multimodal large language models (MLLMs) with a focus on computational efficiency and academic resource accessibility. It addresses the need for reproducible and efficient training of powerful vision-language models, benefiting researchers and developers working on multimodal AI.
How It Works
The project employs several key techniques for efficient multimodal dataset preparation and model training. It utilizes Data Filtering Network (DFN) and CLIP for data quality scoring and selection, followed by multimodal sequence packing into the webdataset format. This packing optimizes data loading and processing for large-scale training. The framework supports both pre-training and supervised fine-tuning (SFT) on various dataset sizes, including large-scale ones like MAmmoTH-VL-10M.
Quick Start & Requirements
pip install -e prismatic-vlms
flash-attn
for pre-training/SFT.Highlighted Details
Maintenance & Community
The project is associated with the COLM 2025 publication. Codebase is developed based on prismatic-vlms
and vlm-evaluation
.
Licensing & Compatibility
The repository does not explicitly state a license. However, the nature of open-source releases and the mention of academic resources suggest a permissive or research-oriented license. Compatibility for commercial use would require explicit clarification.
Limitations & Caveats
The project is presented as a research artifact with a release date of March 31, 2025, implying it may be experimental or subject to further development. Specific hardware requirements for efficient pre-training (e.g., GPU memory, number of GPUs) are not detailed beyond the need for CUDA.
3 weeks ago
Inactive