Vision pretraining model for scalable learning using visual sentences
Top 24.1% on sourcepulse
LVM is a vision pretraining model that frames visual data as "visual sentences" for autoregressive next-token prediction, enabling scalable learning without linguistic data. It targets researchers and practitioners in computer vision and large model development, offering a novel approach to multimodal learning.
How It Works
LVM leverages a sequential modeling approach, converting diverse visual data (images, videos, semantic segmentations, depth reconstructions) into a unified "visual sentence" format. This sequence is then processed by an autoregressive model (OpenLLaMA) that predicts the next visual token, similar to language models. This method allows for training on massive datasets (420 billion tokens) and scales effectively across various model architectures, enabling zero-shot task adaptation through prompt engineering.
Quick Start & Requirements
git clone https://github.com/ytongbai/LVM
and conda env create -f scripts/gpu_environment.yml
.DATASET.md
for dataset preparation and EasyLM
for distributed training details.python -m EasyLM.models.llama.convert_easylm_to_hf
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README mentions releasing the 7B model and that additional sizes will be available later, implying current availability might be limited. The project is presented as a pretraining model, requiring further fine-tuning for specific downstream tasks.
1 year ago
1 week