LLaVA-Plus-Codebase  by LLaVA-VL

Multimodal agent for vision tasks using external tools

created 1 year ago
751 stars

Top 47.2% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVA-Plus enables Large Language and Vision Assistants (LLMVs) to leverage external tools for enhanced multimodal reasoning and task execution. It targets researchers and developers building sophisticated AI agents capable of interacting with and manipulating their environment through tool use. The primary benefit is augmenting LLMVs with functional capabilities beyond their inherent knowledge.

How It Works

LLaVA-Plus builds upon the LLaVA architecture, integrating a mechanism for LLMVs to identify, select, and invoke external tools. This is achieved through specialized training data that teaches the model to generate tool calls as part of its output. The system orchestrates these calls by launching separate "tool workers" that execute specific functionalities, allowing the LLM to delegate tasks like image segmentation or object recognition to specialized modules.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n llava python=3.10), activate it, and install the package (pip install -e .). Additional training packages can be installed with pip install -e ".[train]".
  • Prerequisites: Python 3.10, CUDA (for GPU acceleration), FlashAttention. macOS and Windows users are directed to the base LLaVA instructions.
  • Demo: Requires launching a controller, model worker, tool workers, and a Gradio web server.
  • Links: Project Page, Arxiv, Demo, Data, Model Zoo.

Highlighted Details

  • Enables LLMVs to use tools for general vision tasks.
  • Supports multiple GPUs for inference, with automatic utilization if available.
  • Integrates with various vision tools including Grounding DINO, Segment Anything, and SEEM.
  • Training requires significant resources (e.g., 4/8 A100 GPUs with 80GB memory).

Maintenance & Community

The project is associated with the LLaVA and Vicuna projects. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only and are restricted by the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is CC BY NC 4.0, prohibiting commercial use. Models trained with this dataset are also restricted to research purposes.

Limitations & Caveats

Some code sections are still under preparation and update. The licensing explicitly restricts commercial use and deployment outside of research contexts. Training requires substantial GPU resources.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.