VLM codebase for training visually-conditioned language models
Top 47.3% on sourcepulse
This repository provides a flexible and efficient codebase for training visually-conditioned language models (VLMs). It targets researchers and practitioners looking to experiment with or deploy VLMs, offering support for diverse visual backbones, language models, and easy scaling for large parameter models.
How It Works
Prismatic VLMs supports multiple visual backbones (CLIP, SigLIP, DINOv2, and fusions) via TIMM integration, and arbitrary AutoModelForCausalLM
instances from Hugging Face Transformers. Training leverages PyTorch FSDP and Flash-Attention for efficient scaling from 1B to 34B parameters.
Quick Start & Requirements
pip install -e .
after cloning.Highlighted Details
Maintenance & Community
The project is actively maintained by TRI-ML. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The code is released under the MIT License. Pretrained models inherit licenses from their base datasets and LMs (e.g., Llama Community License for Llama-2 derived models, Apache/MIT for Mistral/Phi-2 derived models). Commercial use is permitted for models adhering to compatible licenses.
Limitations & Caveats
Pretrained models may have licensing restrictions inherited from their training data and base language models. Users must ensure compliance with these underlying licenses.
1 year ago
1 week