LaVIT by jy0205

Multimodal LLM for visual content understanding and generation

Created 2 years ago

601 stars

Top 54.4% on SourcePulse

Project Summary

LaVIT and Video-LaVIT are unified multimodal foundation models designed to empower Large Language Models (LLMs) with visual understanding and generation capabilities. Targeting researchers and developers working with multimodal AI, these models offer a single framework for processing and generating both text and visual content, simplifying complex pipelines.

How It Works

The core innovation lies in a unified pre-training strategy that treats visual content as a sequence of discrete tokens, akin to a foreign language for LLMs. A visual tokenizer converts images and videos into these discrete tokens, which the LLM can then process auto-regressively. A detokenizer reconstructs continuous visual signals from the LLM's generated tokens, enabling both understanding and generation tasks within a single, coherent framework.

Quick Start & Requirements

Pre-trained weights and inference code are available on HuggingFace.
Specific hardware requirements (e.g., GPU, CUDA versions) are not detailed in the README.

Highlighted Details

Unified framework for both image and video multimodal understanding and generation.
Auto-regressive prediction of next visual/textual tokens, leveraging LLM paradigms.
Accepted to ICLR 2024 (LaVIT) and ICML 2024 Oral (Video-LaVIT).
Supports tasks like image/video captioning, visual question answering, and text-to-visual generation.

Maintenance & Community

Official repository for LaVIT and Video-LaVIT.
Active development with recent updates and paper acceptances.
No explicit links to community channels (e.g., Discord, Slack) or roadmaps are provided.

Licensing & Compatibility

The README does not specify a license.
Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

The README lacks specific details on installation, hardware requirements, and licensing, which may hinder rapid adoption. The exact nature of the "dynamic discrete visual tokenization" and "decoupled visual-motional tokenization" requires further investigation into the cited papers.

LaVIT by jy0205

Explore Similar Projects

VLM-Visualizer by zjysteven

MAGIC by yxuansu

LLM2CLIP by microsoft

SEED by AILab-CVC

Liquid by FoundationVision

gill by kohjingyu

Chat-UniVi by PKU-YuanGroup

VisCPM by OpenBMB

Emu3 by baaivision

LlamaGen by FoundationVision

MiniGPT-4-ZH by RiseInRose

Janus by deepseek-ai