LVM  by ytongbai

Vision pretraining model for scalable learning using visual sentences

created 2 years ago
1,831 stars

Top 24.1% on sourcepulse

GitHubView on GitHub
Project Summary

LVM is a vision pretraining model that frames visual data as "visual sentences" for autoregressive next-token prediction, enabling scalable learning without linguistic data. It targets researchers and practitioners in computer vision and large model development, offering a novel approach to multimodal learning.

How It Works

LVM leverages a sequential modeling approach, converting diverse visual data (images, videos, semantic segmentations, depth reconstructions) into a unified "visual sentence" format. This sequence is then processed by an autoregressive model (OpenLLaMA) that predicts the next visual token, similar to language models. This method allows for training on massive datasets (420 billion tokens) and scales effectively across various model architectures, enabling zero-shot task adaptation through prompt engineering.

Quick Start & Requirements

  • Install via git clone https://github.com/ytongbai/LVM and conda env create -f scripts/gpu_environment.yml.
  • Requires a GPU environment (CUDA) and Python. Specific hardware requirements for training are not detailed but imply significant computational resources.
  • Refer to DATASET.md for dataset preparation and EasyLM for distributed training details.
  • Conversion to Huggingface checkpoint: python -m EasyLM.models.llama.convert_easylm_to_hf.
  • Demo and inference available on HuggingFace Spaces.

Highlighted Details

  • Trained on 1.2 billion LAION images after deep filtering.
  • Supports model sizes from 100M to 30B parameters.
  • Achieves scalable learning across model and data diversity.
  • Enables zero-shot task adaptation via visual prompting (sequential and analogy).

Maintenance & Community

  • Developed in collaboration with HuggingFace.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The README mentions releasing the 7B model and that additional sizes will be available later, implying current availability might be limited. The project is presented as a pretraining model, requiring further fine-tuning for specific downstream tasks.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.