Open-Qwen2VL  by Victorwz

Multimodal LLM pre-training and fine-tuning

Created 6 months ago
267 stars

Top 95.9% on SourcePulse

GitHubView on GitHub
Project Summary

Open-Qwen2VL provides an open-source framework for training multimodal large language models (MLLMs) with a focus on computational efficiency and academic resource accessibility. It addresses the need for reproducible and efficient training of powerful vision-language models, benefiting researchers and developers working on multimodal AI.

How It Works

The project employs several key techniques for efficient multimodal dataset preparation and model training. It utilizes Data Filtering Network (DFN) and CLIP for data quality scoring and selection, followed by multimodal sequence packing into the webdataset format. This packing optimizes data loading and processing for large-scale training. The framework supports both pre-training and supervised fine-tuning (SFT) on various dataset sizes, including large-scale ones like MAmmoTH-VL-10M.

Quick Start & Requirements

  • Installation: pip install -e prismatic-vlms
  • Prerequisites: Python 3.10, CUDA (for flash-attention), flash-attn for pre-training/SFT.
  • Setup: Requires downloading pre-trained checkpoints and datasets.
  • Docs: https://github.com/Victorwz/Open-Qwen2VL (implicitly, as it's the repo)

Highlighted Details

  • Supports data quality scoring, selection, and resharding into webdataset format.
  • Enables multimodal sequence packing for efficient large-scale dataset handling.
  • Offers pre-training and supervised fine-tuning capabilities.
  • Released pre-trained model checkpoints and pre-training data.

Maintenance & Community

The project is associated with the COLM 2025 publication. Codebase is developed based on prismatic-vlms and vlm-evaluation.

Licensing & Compatibility

The repository does not explicitly state a license. However, the nature of open-source releases and the mention of academic resources suggest a permissive or research-oriented license. Compatibility for commercial use would require explicit clarification.

Limitations & Caveats

The project is presented as a research artifact with a release date of March 31, 2025, implying it may be experimental or subject to further development. Specific hardware requirements for efficient pre-training (e.g., GPU memory, number of GPUs) are not detailed beyond the need for CUDA.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.