LaVIT  by jy0205

Multimodal LLM for visual content understanding and generation

Created 2 years ago
590 stars

Top 55.1% on SourcePulse

GitHubView on GitHub
Project Summary

LaVIT and Video-LaVIT are unified multimodal foundation models designed to empower Large Language Models (LLMs) with visual understanding and generation capabilities. Targeting researchers and developers working with multimodal AI, these models offer a single framework for processing and generating both text and visual content, simplifying complex pipelines.

How It Works

The core innovation lies in a unified pre-training strategy that treats visual content as a sequence of discrete tokens, akin to a foreign language for LLMs. A visual tokenizer converts images and videos into these discrete tokens, which the LLM can then process auto-regressively. A detokenizer reconstructs continuous visual signals from the LLM's generated tokens, enabling both understanding and generation tasks within a single, coherent framework.

Quick Start & Requirements

  • Pre-trained weights and inference code are available on HuggingFace.
  • Specific hardware requirements (e.g., GPU, CUDA versions) are not detailed in the README.

Highlighted Details

  • Unified framework for both image and video multimodal understanding and generation.
  • Auto-regressive prediction of next visual/textual tokens, leveraging LLM paradigms.
  • Accepted to ICLR 2024 (LaVIT) and ICML 2024 Oral (Video-LaVIT).
  • Supports tasks like image/video captioning, visual question answering, and text-to-visual generation.

Maintenance & Community

  • Official repository for LaVIT and Video-LaVIT.
  • Active development with recent updates and paper acceptances.
  • No explicit links to community channels (e.g., Discord, Slack) or roadmaps are provided.

Licensing & Compatibility

  • The README does not specify a license.
  • Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

The README lacks specific details on installation, hardware requirements, and licensing, which may hinder rapid adoption. The exact nature of the "dynamic discrete visual tokenization" and "decoupled visual-motional tokenization" requires further investigation into the cited papers.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.