VLM for training vision-language models from scratch
Top 11.7% on sourcepulse
MiniMind-V is an open-source project that enables users to train a 26M-parameter visual language model (VLM) from scratch in approximately one hour with minimal cost. It targets individuals seeking to understand and experiment with VLM development, offering a simplified architecture, dataset preparation, and training pipeline for personal GPU accessibility.
How It Works
MiniMind-V integrates a visual encoder (CLIP ViT-base/patch16) with a language model (MiniMind LLM). Images are processed by the visual encoder into 196 visual tokens, which are then projected and aligned with the LLM's text token space using a linear transformation. This approach treats visual information as a special "foreign language" that the LLM learns to interpret, allowing for efficient VLM training and inference on consumer hardware.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The projection layer's cross-modal alignment may be less performant than cross-attention methods. The CLIP model uses a 224x224 resolution (or 128x128 in the dataset), which might limit fine-grained image feature representation. Multi-image fine-tuning is noted as having limited effectiveness due to dataset size and English-only content.
3 months ago
1 day