Open-Qwen2VL by Victorwz

Multimodal LLM pre-training and fine-tuning

Created 10 months ago

299 stars

Top 89.1% on SourcePulse

Project Summary

Open-Qwen2VL provides an open-source framework for training multimodal large language models (MLLMs) with a focus on computational efficiency and academic resource accessibility. It addresses the need for reproducible and efficient training of powerful vision-language models, benefiting researchers and developers working on multimodal AI.

How It Works

The project employs several key techniques for efficient multimodal dataset preparation and model training. It utilizes Data Filtering Network (DFN) and CLIP for data quality scoring and selection, followed by multimodal sequence packing into the webdataset format. This packing optimizes data loading and processing for large-scale training. The framework supports both pre-training and supervised fine-tuning (SFT) on various dataset sizes, including large-scale ones like MAmmoTH-VL-10M.

Quick Start & Requirements

Installation: pip install -e prismatic-vlms
Prerequisites: Python 3.10, CUDA (for flash-attention), flash-attn for pre-training/SFT.
Setup: Requires downloading pre-trained checkpoints and datasets.
Docs: https://github.com/Victorwz/Open-Qwen2VL (implicitly, as it's the repo)

Highlighted Details

Supports data quality scoring, selection, and resharding into webdataset format.
Enables multimodal sequence packing for efficient large-scale dataset handling.
Offers pre-training and supervised fine-tuning capabilities.
Released pre-trained model checkpoints and pre-training data.

Maintenance & Community

The project is associated with the COLM 2025 publication. Codebase is developed based on prismatic-vlms and vlm-evaluation.

Licensing & Compatibility

The repository does not explicitly state a license. However, the nature of open-source releases and the mention of academic resources suggest a permissive or research-oriented license. Compatibility for commercial use would require explicit clarification.

Limitations & Caveats

The project is presented as a research artifact with a release date of March 31, 2025, implying it may be experimental or subject to further development. Specific hardware requirements for efficient pre-training (e.g., GPU memory, number of GPUs) are not detailed beyond the need for CUDA.

Open-Qwen2VL by Victorwz

Explore Similar Projects

CM3Leon by kyegomez

RLAIF-V by RLHF-V

MedTrinity-25M by UCSC-VLAA

WanJuan1.0 by opendatalab

Awesome-Unified-Multimodal-Models by AIDC-AI

Awesome_Matching_Pretraining_Transfering by Paranioar

SLIP by facebookresearch

dclm by mlfoundations

Awesome-LLMs-Datasets by lmmlzn

Curator by NVIDIA-NeMo

AutoDL by DeepWisdom

LLMDataHub by Zjh-819