LaVIT  by jy0205

Multimodal LLM for visual content understanding and generation

created 1 year ago
585 stars

Top 56.2% on sourcepulse

GitHubView on GitHub
Project Summary

LaVIT and Video-LaVIT are unified multimodal foundation models designed to empower Large Language Models (LLMs) with visual understanding and generation capabilities. Targeting researchers and developers working with multimodal AI, these models offer a single framework for processing and generating both text and visual content, simplifying complex pipelines.

How It Works

The core innovation lies in a unified pre-training strategy that treats visual content as a sequence of discrete tokens, akin to a foreign language for LLMs. A visual tokenizer converts images and videos into these discrete tokens, which the LLM can then process auto-regressively. A detokenizer reconstructs continuous visual signals from the LLM's generated tokens, enabling both understanding and generation tasks within a single, coherent framework.

Quick Start & Requirements

  • Pre-trained weights and inference code are available on HuggingFace.
  • Specific hardware requirements (e.g., GPU, CUDA versions) are not detailed in the README.

Highlighted Details

  • Unified framework for both image and video multimodal understanding and generation.
  • Auto-regressive prediction of next visual/textual tokens, leveraging LLM paradigms.
  • Accepted to ICLR 2024 (LaVIT) and ICML 2024 Oral (Video-LaVIT).
  • Supports tasks like image/video captioning, visual question answering, and text-to-visual generation.

Maintenance & Community

  • Official repository for LaVIT and Video-LaVIT.
  • Active development with recent updates and paper acceptances.
  • No explicit links to community channels (e.g., Discord, Slack) or roadmaps are provided.

Licensing & Compatibility

  • The README does not specify a license.
  • Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

The README lacks specific details on installation, hardware requirements, and licensing, which may hinder rapid adoption. The exact nature of the "dynamic discrete visual tokenization" and "decoupled visual-motional tokenization" requires further investigation into the cited papers.

Health Check
Last commit

10 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.