LaVIT  by jy0205

Multimodal LLM for visual content understanding and generation

Created 2 years ago
601 stars

Top 54.4% on SourcePulse

GitHubView on GitHub
Project Summary

LaVIT and Video-LaVIT are unified multimodal foundation models designed to empower Large Language Models (LLMs) with visual understanding and generation capabilities. Targeting researchers and developers working with multimodal AI, these models offer a single framework for processing and generating both text and visual content, simplifying complex pipelines.

How It Works

The core innovation lies in a unified pre-training strategy that treats visual content as a sequence of discrete tokens, akin to a foreign language for LLMs. A visual tokenizer converts images and videos into these discrete tokens, which the LLM can then process auto-regressively. A detokenizer reconstructs continuous visual signals from the LLM's generated tokens, enabling both understanding and generation tasks within a single, coherent framework.

Quick Start & Requirements

  • Pre-trained weights and inference code are available on HuggingFace.
  • Specific hardware requirements (e.g., GPU, CUDA versions) are not detailed in the README.

Highlighted Details

  • Unified framework for both image and video multimodal understanding and generation.
  • Auto-regressive prediction of next visual/textual tokens, leveraging LLM paradigms.
  • Accepted to ICLR 2024 (LaVIT) and ICML 2024 Oral (Video-LaVIT).
  • Supports tasks like image/video captioning, visual question answering, and text-to-visual generation.

Maintenance & Community

  • Official repository for LaVIT and Video-LaVIT.
  • Active development with recent updates and paper acceptances.
  • No explicit links to community channels (e.g., Discord, Slack) or roadmaps are provided.

Licensing & Compatibility

  • The README does not specify a license.
  • Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

The README lacks specific details on installation, hardware requirements, and licensing, which may hinder rapid adoption. The exact nature of the "dynamic discrete visual tokenization" and "decoupled visual-motional tokenization" requires further investigation into the cited papers.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
470
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.