Discover and explore top open-source AI tools and projects—updated daily.
AviSoori1xBuild and understand vision-language models from scratch
Top 99.9% on SourcePulse
This repository provides a from-scratch implementation of a Vision Language Model (VLM) in pure PyTorch, designed for educational purposes. It targets engineers and researchers seeking a deep understanding of VLM architectures, offering a consolidated, hackable codebase that simplifies complex concepts. The project aims to demystify VLM construction by presenting a clear, single-file implementation.
How It Works
The VLM comprises three core components: an image encoder (a from-scratch CLIP Vision Transformer), a vision-language projector (an MLP to align visual features with text embeddings, creating "visual tokens"), and a decoder-only language model. The decoder is an autoregressive, character-level model, drawing inspiration from Andrej Karpathy's makemore for its scaled dot-product self-attention mechanism. This approach allows for a unified, end-to-end implementation within a single PyTorch file, prioritizing clarity and ease of modification over raw performance.
Quick Start & Requirements
The project is primarily demonstrated via Jupyter notebooks (seeMoE_from_scratch.ipynb, seemore_Concise.ipynb) and a single implementation file (seemore.py). Development occurred on Databricks using an A100 GPU. While specific installation commands are not detailed, core dependencies include PyTorch and Python. MLFlow is suggested for metric tracking but is optional. Relevant resources include:
https://huggingface.co/blog/AviSoori1x/seemoehttps://avisoori1x.github.io/2024/04/22/seemore-_Implement_a_Vision_Language_Model_from_Scratch.htmlmakemore repository: https://github.com/karpathy/makemoreHighlighted Details
seemore.py) for maximum hackability and educational value.modules subdirectory.Maintenance & Community
The repository is maintained by AviSoori1x. No specific community channels (e.g., Discord, Slack) or public roadmaps are mentioned in the provided README.
Licensing & Compatibility
The README does not explicitly state the project's license. Potential users should verify licensing terms before integration, especially for commercial applications.
Limitations & Caveats
The implementation prioritizes readability and hackability over performance optimization. It is described as a "simplistic" version of advanced VLMs, and the development environment was specific (Databricks, A100 GPU), though the code is intended to be runnable elsewhere.
1 year ago
Inactive
JosefAlbers
TRI-ML