seemore  by AviSoori1x

Build and understand vision-language models from scratch

Created 1 year ago
251 stars

Top 99.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a from-scratch implementation of a Vision Language Model (VLM) in pure PyTorch, designed for educational purposes. It targets engineers and researchers seeking a deep understanding of VLM architectures, offering a consolidated, hackable codebase that simplifies complex concepts. The project aims to demystify VLM construction by presenting a clear, single-file implementation.

How It Works

The VLM comprises three core components: an image encoder (a from-scratch CLIP Vision Transformer), a vision-language projector (an MLP to align visual features with text embeddings, creating "visual tokens"), and a decoder-only language model. The decoder is an autoregressive, character-level model, drawing inspiration from Andrej Karpathy's makemore for its scaled dot-product self-attention mechanism. This approach allows for a unified, end-to-end implementation within a single PyTorch file, prioritizing clarity and ease of modification over raw performance.

Quick Start & Requirements

The project is primarily demonstrated via Jupyter notebooks (seeMoE_from_scratch.ipynb, seemore_Concise.ipynb) and a single implementation file (seemore.py). Development occurred on Databricks using an A100 GPU. While specific installation commands are not detailed, core dependencies include PyTorch and Python. MLFlow is suggested for metric tracking but is optional. Relevant resources include:

  • Hugging Face Blog: https://huggingface.co/blog/AviSoori1x/seemoe
  • Personal Blog: https://avisoori1x.github.io/2024/04/22/seemore-_Implement_a_Vision_Language_Model_from_Scratch.html
  • makemore repository: https://github.com/karpathy/makemore

Highlighted Details

  • Full "from scratch" implementation of a Vision Language Model in pure PyTorch.
  • Consolidated into a single file (seemore.py) for maximum hackability and educational value.
  • Presents a simplified architecture akin to Grok 1.5/GPT-4 Vision.
  • Codebase is modular, with components available in the modules subdirectory.

Maintenance & Community

The repository is maintained by AviSoori1x. No specific community channels (e.g., Discord, Slack) or public roadmaps are mentioned in the provided README.

Licensing & Compatibility

The README does not explicitly state the project's license. Potential users should verify licensing terms before integration, especially for commercial applications.

Limitations & Caveats

The implementation prioritizes readability and hackability over performance optimization. It is described as a "simplistic" version of advanced VLMs, and the development environment was specific (Databricks, A100 GPU), though the code is intended to be runnable elsewhere.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.