vstar  by penghao-wu

PyTorch implementation for a multimodal LLM research paper

Created 1 year ago
673 stars

Top 50.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

V* provides a PyTorch implementation for guided visual search within multimodal Large Language Models (LLMs), addressing the challenge of grounding LLM reasoning in specific visual elements. It's designed for researchers and developers working on advanced vision-language understanding and generation tasks.

How It Works

V* integrates a visual search mechanism as a core component of multimodal LLMs. This approach allows the model to actively identify and focus on relevant objects or regions within an image based on textual queries or context, enhancing the accuracy and specificity of visual question answering and other vision-language tasks.

Quick Start & Requirements

  • Installation: Use Conda to create and activate an environment (conda create -n vstar python=3.10 -y, conda activate vstar), then install dependencies (pip install -r requirements.txt, pip install flash-attn --no-build-isolation). Set PYTHONPATH.
  • Prerequisites: Python 3.10, Conda, PyTorch. FlashAttention is recommended for performance.
  • Pre-trained Models: Download links for VQA LLM and visual search models are provided.
  • Datasets: Requires LAION-CC-SBU subset, instruction tuning datasets, and images from COCO-2014, COCO-2017, and GQA.
  • Demo: Run python app.py for a local Gradio demo.
  • Links: Paper, Project Page, Online Demo

Highlighted Details

  • Implements "V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs".
  • Includes a benchmark suite (V*Bench) for evaluation.
  • Supports training for both VQA LLM and visual search components.
  • Built upon LLaVA and LISA projects.

Maintenance & Community

  • Primary contributors: Penghao Wu, Saining Xie.
  • Project is based on LLaVA and LISA.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The project requires significant dataset preparation and management, including downloading and organizing large image datasets (COCO, GQA) and specific subsets for training. Training involves multiple stages and potentially substantial computational resources.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.