LLaVA-Plus-Codebase by LLaVA-VL

Multimodal agent for vision tasks using external tools

Created 2 years ago

763 stars

Top 45.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Haotian Liu

Author of LLaVA; Research Scientist at xAI

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

LLaVA-Plus enables Large Language and Vision Assistants (LLMVs) to leverage external tools for enhanced multimodal reasoning and task execution. It targets researchers and developers building sophisticated AI agents capable of interacting with and manipulating their environment through tool use. The primary benefit is augmenting LLMVs with functional capabilities beyond their inherent knowledge.

How It Works

LLaVA-Plus builds upon the LLaVA architecture, integrating a mechanism for LLMVs to identify, select, and invoke external tools. This is achieved through specialized training data that teaches the model to generate tool calls as part of its output. The system orchestrates these calls by launching separate "tool workers" that execute specific functionalities, allowing the LLM to delegate tasks like image segmentation or object recognition to specialized modules.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n llava python=3.10), activate it, and install the package (pip install -e .). Additional training packages can be installed with pip install -e ".[train]".
Prerequisites: Python 3.10, CUDA (for GPU acceleration), FlashAttention. macOS and Windows users are directed to the base LLaVA instructions.
Demo: Requires launching a controller, model worker, tool workers, and a Gradio web server.
Links: Project Page, Arxiv, Demo, Data, Model Zoo.

Highlighted Details

Enables LLMVs to use tools for general vision tasks.
Supports multiple GPUs for inference, with automatic utilization if available.
Integrates with various vision tools including Grounding DINO, Segment Anything, and SEEM.
Training requires significant resources (e.g., 4/8 A100 GPUs with 80GB memory).

Maintenance & Community

The project is associated with the LLaVA and Vicuna projects. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only and are restricted by the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is CC BY NC 4.0, prohibiting commercial use. Models trained with this dataset are also restricted to research purposes.

Limitations & Caveats

Some code sections are still under preparation and update. The licensing explicitly restricts commercial use and deployment outside of research contexts. Training requires substantial GPU resources.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days