LVM by ytongbai

Vision pretraining model for scalable learning using visual sentences

Created 2 years ago

1,838 stars

Top 23.4% on SourcePulse

Project Summary

LVM is a vision pretraining model that frames visual data as "visual sentences" for autoregressive next-token prediction, enabling scalable learning without linguistic data. It targets researchers and practitioners in computer vision and large model development, offering a novel approach to multimodal learning.

How It Works

LVM leverages a sequential modeling approach, converting diverse visual data (images, videos, semantic segmentations, depth reconstructions) into a unified "visual sentence" format. This sequence is then processed by an autoregressive model (OpenLLaMA) that predicts the next visual token, similar to language models. This method allows for training on massive datasets (420 billion tokens) and scales effectively across various model architectures, enabling zero-shot task adaptation through prompt engineering.

Quick Start & Requirements

Install via git clone https://github.com/ytongbai/LVM and conda env create -f scripts/gpu_environment.yml.
Requires a GPU environment (CUDA) and Python. Specific hardware requirements for training are not detailed but imply significant computational resources.
Refer to DATASET.md for dataset preparation and EasyLM for distributed training details.
Conversion to Huggingface checkpoint: python -m EasyLM.models.llama.convert_easylm_to_hf.
Demo and inference available on HuggingFace Spaces.

Highlighted Details

Trained on 1.2 billion LAION images after deep filtering.
Supports model sizes from 100M to 30B parameters.
Achieves scalable learning across model and data diversity.
Enables zero-shot task adaptation via visual prompting (sequential and analogy).

Maintenance & Community

Developed in collaboration with HuggingFace.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

Licensed under the Apache 2.0 License.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The README mentions releasing the 7B model and that additional sizes will be available later, implying current availability might be limited. The project is presented as a pretraining model, requiring further fine-tuning for specific downstream tasks.

LVM by ytongbai

Explore Similar Projects

lynx-llm by bytedance

UniWorld by PKU-YuanGroup

SEED by AILab-CVC

Rex-Omni by IDEA-Research

UNO by bytedance

Emu3 by baaivision

MiniGPT-4-ZH by RiseInRose

Vary by Ucas-HaoranWei

OFA by OFA-Sys

open_flamingo by mlfoundations

minimind-v by jingyaogong

prismatic-vlms by TRI-ML