VARGPT by VARGPT-family

Multimodal LLM for visual understanding and generation tasks

Created 11 months ago

343 stars

Top 80.6% on SourcePulse

Project Summary

VARGPT is a multimodal large language model designed for unified visual understanding and generation tasks, targeting researchers and developers working with vision-language models. It offers capabilities in image captioning, visual question answering, and text-to-image generation within a single autoregressive framework.

How It Works

VARGPT models understanding and generation as distinct paradigms within a unified architecture. For understanding, it predicts the next token, similar to standard LLMs. For generation, it predicts the next scale, enabling text-to-image synthesis. This dual approach is achieved through a three-stage instruction tuning process.

Quick Start & Requirements

Install dependencies: pip3 install -r requirements.txt
Requires PyTorch, Transformers, and other libraries listed in requirements.txt.
Inference code is available for both understanding and generation tasks.
Official models and datasets are hosted on Hugging Face.
See Webpage and ArXiv for more details.

Highlighted Details

Unified model for visual understanding (captioning, VQA) and generation (text-to-image).
Leverages a three-stage instruction tuning process.
Supports multimodal generation with prompts like "Please design a drawing of a butterfly on a flower."
Evaluation can be performed using lmms-eval with provided scripts.

Maintenance & Community

The project has released VARGPT-v1.1 with updated code and models.
Future updates and maintenance will primarily occur in the VARGPT-v1.1 repository.
Heavily based on LLaVA-1.5, VAR, LLaVA-NeXT, and other established projects.

Licensing & Compatibility

Licensed under the Apache License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Visual generation capabilities are currently constrained by the ImageNet dataset (1.28M images); future iterations aim to improve data quality and quantity.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

cobra by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago

Updated 1 year ago

SmartEdit by TencentARC

Research paper for complex instruction-based image editing using multimodal LLMs

Created 2 years ago

Updated 1 year ago

VARGPT-v1.1 by VARGPT-family

Visual autoregressive model for multimodal tasks

Created 9 months ago

Updated 9 months ago

SEED-X by AILab-CVC

Multimodal AI assistant for real-world applications

Created 1 year ago

Updated 10 months ago

LLMGA by JIA-Lab-research

Multimodal LLM for image generation/editing, leveraging LLMs for detailed prompts

Created 2 years ago

Updated 7 months ago

GPT4RoI by jshilong

Instruction tuning LLM on regions-of-interest for visual understanding

Created 2 years ago

Updated 7 months ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI).

SEED by AILab-CVC

Multimodal LLM research paper with visual tokenization

Created 2 years ago

Updated 1 year ago

Vitron by SkyworkAI

Vision LLM research paper for pixel-level understanding, generation, segmentation & editing

Created 1 year ago

Updated 1 year ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Amanpreet Singh

Amanpreet Singh(Cofounder of Contextual AI), and

5 more.

gritlm by ContextualAI

Research paper and models for generative representational instruction tuning

Created 1 year ago

Updated 6 months ago

PandaGPT by yxuansu

Multimodal model for instruction following across six modalities

Created 2 years ago

Updated 2 years ago

Starred by

Edward Sun

Edward Sun(Research Scientist at Meta Superintelligence Lab),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

4 more.

OFA by OFA-Sys

Unified sequence-to-sequence model for cross-modality, vision, and language tasks

Created 4 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

Any-to-any multimodal LLM research paper

Created 2 years ago

Updated 8 months ago

Feedback? Help us improve.