minimind-v  by jingyaogong

VLM for training vision-language models from scratch

created 10 months ago
4,270 stars

Top 11.7% on sourcepulse

GitHubView on GitHub
Project Summary

MiniMind-V is an open-source project that enables users to train a 26M-parameter visual language model (VLM) from scratch in approximately one hour with minimal cost. It targets individuals seeking to understand and experiment with VLM development, offering a simplified architecture, dataset preparation, and training pipeline for personal GPU accessibility.

How It Works

MiniMind-V integrates a visual encoder (CLIP ViT-base/patch16) with a language model (MiniMind LLM). Images are processed by the visual encoder into 196 visual tokens, which are then projected and aligned with the LLM's text token space using a linear transformation. This approach treats visual information as a special "foreign language" that the LLM learns to interpret, allowing for efficient VLM training and inference on consumer hardware.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.10.16, CUDA 12.2, PyTorch with CUDA support. Requires downloading CLIP and base language model weights.
  • Setup: Clone repository, download models, install dependencies. Estimated setup time: ~30-60 minutes.
  • Links: Online Demo, Video Introduction

Highlighted Details

  • Achieves a 26M-parameter VLM with 1-hour training on an NVIDIA 3090.
  • Offers pre-trained and SFT models with sizes ranging from 26M to 109M parameters.
  • Supports single and multi-image inputs.
  • Provides a simplified VLM structure with minimal code changes from its base LLM.

Maintenance & Community

  • Active development with recent updates in April 2025.
  • Community contributions are welcomed via Pull Requests and Issues.
  • Inspired by and references LLaVA and Chinese-LLaVA projects.

Licensing & Compatibility

  • Licensed under Apache-2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The projection layer's cross-modal alignment may be less performant than cross-attention methods. The CLIP model uses a 224x224 resolution (or 128x128 in the dataset), which might limit fine-grained image feature representation. Multi-image fine-tuning is noted as having limited effectiveness due to dataset size and English-only content.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
5
Star History
1,012 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.