vstar by penghao-wu

PyTorch implementation for a multimodal LLM research paper

Created 2 years ago

683 stars

Top 49.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Saining Xie

Professor at NYU

Project Summary

V* provides a PyTorch implementation for guided visual search within multimodal Large Language Models (LLMs), addressing the challenge of grounding LLM reasoning in specific visual elements. It's designed for researchers and developers working on advanced vision-language understanding and generation tasks.

How It Works

V* integrates a visual search mechanism as a core component of multimodal LLMs. This approach allows the model to actively identify and focus on relevant objects or regions within an image based on textual queries or context, enhancing the accuracy and specificity of visual question answering and other vision-language tasks.

Quick Start & Requirements

Installation: Use Conda to create and activate an environment (conda create -n vstar python=3.10 -y, conda activate vstar), then install dependencies (pip install -r requirements.txt, pip install flash-attn --no-build-isolation). Set PYTHONPATH.
Prerequisites: Python 3.10, Conda, PyTorch. FlashAttention is recommended for performance.
Pre-trained Models: Download links for VQA LLM and visual search models are provided.
Datasets: Requires LAION-CC-SBU subset, instruction tuning datasets, and images from COCO-2014, COCO-2017, and GQA.
Demo: Run python app.py for a local Gradio demo.
Links: Paper, Project Page, Online Demo

Highlighted Details

Implements "V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs".
Includes a benchmark suite (V*Bench) for evaluation.
Supports training for both VQA LLM and visual search components.
Built upon LLaVA and LISA projects.

Maintenance & Community

Primary contributors: Penghao Wu, Saining Xie.
Project is based on LLaVA and LISA.

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The project requires significant dataset preparation and management, including downloading and organizing large image datasets (COCO, GQA) and specific subsets for training. Training involves multiple stages and potentially substantial computational resources.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days