vstar  by penghao-wu

PyTorch implementation for a multimodal LLM research paper

created 1 year ago
653 stars

Top 52.1% on sourcepulse

GitHubView on GitHub
Project Summary

V* provides a PyTorch implementation for guided visual search within multimodal Large Language Models (LLMs), addressing the challenge of grounding LLM reasoning in specific visual elements. It's designed for researchers and developers working on advanced vision-language understanding and generation tasks.

How It Works

V* integrates a visual search mechanism as a core component of multimodal LLMs. This approach allows the model to actively identify and focus on relevant objects or regions within an image based on textual queries or context, enhancing the accuracy and specificity of visual question answering and other vision-language tasks.

Quick Start & Requirements

  • Installation: Use Conda to create and activate an environment (conda create -n vstar python=3.10 -y, conda activate vstar), then install dependencies (pip install -r requirements.txt, pip install flash-attn --no-build-isolation). Set PYTHONPATH.
  • Prerequisites: Python 3.10, Conda, PyTorch. FlashAttention is recommended for performance.
  • Pre-trained Models: Download links for VQA LLM and visual search models are provided.
  • Datasets: Requires LAION-CC-SBU subset, instruction tuning datasets, and images from COCO-2014, COCO-2017, and GQA.
  • Demo: Run python app.py for a local Gradio demo.
  • Links: Paper, Project Page, Online Demo

Highlighted Details

  • Implements "V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs".
  • Includes a benchmark suite (V*Bench) for evaluation.
  • Supports training for both VQA LLM and visual search components.
  • Built upon LLaVA and LISA projects.

Maintenance & Community

  • Primary contributors: Penghao Wu, Saining Xie.
  • Project is based on LLaVA and LISA.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The project requires significant dataset preparation and management, including downloading and organizing large image datasets (COCO, GQA) and specific subsets for training. Training involves multiple stages and potentially substantial computational resources.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
52 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.