seemore by AviSoori1x

Build and understand vision-language models from scratch

Created 1 year ago

255 stars

Top 98.8% on SourcePulse

Project Summary

This repository provides a from-scratch implementation of a Vision Language Model (VLM) in pure PyTorch, designed for educational purposes. It targets engineers and researchers seeking a deep understanding of VLM architectures, offering a consolidated, hackable codebase that simplifies complex concepts. The project aims to demystify VLM construction by presenting a clear, single-file implementation.

How It Works

The VLM comprises three core components: an image encoder (a from-scratch CLIP Vision Transformer), a vision-language projector (an MLP to align visual features with text embeddings, creating "visual tokens"), and a decoder-only language model. The decoder is an autoregressive, character-level model, drawing inspiration from Andrej Karpathy's makemore for its scaled dot-product self-attention mechanism. This approach allows for a unified, end-to-end implementation within a single PyTorch file, prioritizing clarity and ease of modification over raw performance.

Quick Start & Requirements

The project is primarily demonstrated via Jupyter notebooks (seeMoE_from_scratch.ipynb, seemore_Concise.ipynb) and a single implementation file (seemore.py). Development occurred on Databricks using an A100 GPU. While specific installation commands are not detailed, core dependencies include PyTorch and Python. MLFlow is suggested for metric tracking but is optional. Relevant resources include:

Hugging Face Blog: https://huggingface.co/blog/AviSoori1x/seemoe
Personal Blog: https://avisoori1x.github.io/2024/04/22/seemore-_Implement_a_Vision_Language_Model_from_Scratch.html
makemore repository: https://github.com/karpathy/makemore

Highlighted Details

Full "from scratch" implementation of a Vision Language Model in pure PyTorch.
Consolidated into a single file (seemore.py) for maximum hackability and educational value.
Presents a simplified architecture akin to Grok 1.5/GPT-4 Vision.
Codebase is modular, with components available in the modules subdirectory.

Maintenance & Community

The repository is maintained by AviSoori1x. No specific community channels (e.g., Discord, Slack) or public roadmaps are mentioned in the provided README.

Licensing & Compatibility

The README does not explicitly state the project's license. Potential users should verify licensing terms before integration, especially for commercial applications.

Limitations & Caveats

The implementation prioritizes readability and hackability over performance optimization. It is described as a "simplistic" version of advanced VLMs, and the development environment was specific (Databricks, A100 GPU), though the code is intended to be runnable elsewhere.

seemore by AviSoori1x

Explore Similar Projects

dots.vlm1 by rednote-hilab

EVE by baaivision

cobra by h-zhao1997

PaDT by Gorilla-Lab-SCUT

PVM by JosefAlbers

MoAI by ByungKwanLee

awesome-vlm-architectures by gokayfem

molmo by allenai

Qwen3.5 by QwenLM

Vary by Ucas-HaoranWei

Efficient-AI-Backbones by huawei-noah

prismatic-vlms by TRI-ML