minimind-v by jingyaogong

VLM for training vision-language models from scratch

Created 1 year ago

5,931 stars

Top 8.5% on SourcePulse

Project Summary

MiniMind-V is an open-source project that enables users to train a 26M-parameter visual language model (VLM) from scratch in approximately one hour with minimal cost. It targets individuals seeking to understand and experiment with VLM development, offering a simplified architecture, dataset preparation, and training pipeline for personal GPU accessibility.

How It Works

MiniMind-V integrates a visual encoder (CLIP ViT-base/patch16) with a language model (MiniMind LLM). Images are processed by the visual encoder into 196 visual tokens, which are then projected and aligned with the LLM's text token space using a linear transformation. This approach treats visual information as a special "foreign language" that the LLM learns to interpret, allowing for efficient VLM training and inference on consumer hardware.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.10.16, CUDA 12.2, PyTorch with CUDA support. Requires downloading CLIP and base language model weights.
Setup: Clone repository, download models, install dependencies. Estimated setup time: ~30-60 minutes.
Links: Online Demo, Video Introduction

Highlighted Details

Achieves a 26M-parameter VLM with 1-hour training on an NVIDIA 3090.
Offers pre-trained and SFT models with sizes ranging from 26M to 109M parameters.
Supports single and multi-image inputs.
Provides a simplified VLM structure with minimal code changes from its base LLM.

Maintenance & Community

Active development with recent updates in April 2025.
Community contributions are welcomed via Pull Requests and Issues.
Inspired by and references LLaVA and Chinese-LLaVA projects.

Licensing & Compatibility

Licensed under Apache-2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The projection layer's cross-modal alignment may be less performant than cross-attention methods. The CLIP model uses a 224x224 resolution (or 128x128 in the dataset), which might limit fine-grained image feature representation. Multi-image fine-tuning is noted as having limited effectiveness due to dataset size and English-only content.

minimind-v by jingyaogong

Explore Similar Projects

GPT4Scene-and-VLN-R1 by Qi-Zhangyang

Open-LLaVA-NeXT by xiaoachen98

SEED by AILab-CVC

X-VLM by zengyan-97

LVM by ytongbai

molmo by allenai

MiniGPT-4-ZH by RiseInRose

Vary by Ucas-HaoranWei

HunyuanVideo-I2V by Tencent-Hunyuan

open_flamingo by mlfoundations

NExT-GPT by NExT-GPT

open_clip by mlfoundations