LLaVA-Plus-Codebase  by LLaVA-VL

Multimodal agent for vision tasks using external tools

Created 1 year ago
758 stars

Top 45.8% on SourcePulse

GitHubView on GitHub
Project Summary

LLaVA-Plus enables Large Language and Vision Assistants (LLMVs) to leverage external tools for enhanced multimodal reasoning and task execution. It targets researchers and developers building sophisticated AI agents capable of interacting with and manipulating their environment through tool use. The primary benefit is augmenting LLMVs with functional capabilities beyond their inherent knowledge.

How It Works

LLaVA-Plus builds upon the LLaVA architecture, integrating a mechanism for LLMVs to identify, select, and invoke external tools. This is achieved through specialized training data that teaches the model to generate tool calls as part of its output. The system orchestrates these calls by launching separate "tool workers" that execute specific functionalities, allowing the LLM to delegate tasks like image segmentation or object recognition to specialized modules.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n llava python=3.10), activate it, and install the package (pip install -e .). Additional training packages can be installed with pip install -e ".[train]".
  • Prerequisites: Python 3.10, CUDA (for GPU acceleration), FlashAttention. macOS and Windows users are directed to the base LLaVA instructions.
  • Demo: Requires launching a controller, model worker, tool workers, and a Gradio web server.
  • Links: Project Page, Arxiv, Demo, Data, Model Zoo.

Highlighted Details

  • Enables LLMVs to use tools for general vision tasks.
  • Supports multiple GPUs for inference, with automatic utilization if available.
  • Integrates with various vision tools including Grounding DINO, Segment Anything, and SEEM.
  • Training requires significant resources (e.g., 4/8 A100 GPUs with 80GB memory).

Maintenance & Community

The project is associated with the LLaVA and Vicuna projects. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only and are restricted by the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is CC BY NC 4.0, prohibiting commercial use. Models trained with this dataset are also restricted to research purposes.

Limitations & Caveats

Some code sections are still under preparation and update. The licensing explicitly restricts commercial use and deployment outside of research contexts. Training requires substantial GPU resources.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Andrew Ng Andrew Ng(Founder of DeepLearning.AI; Cofounder of Coursera; Professor at Stanford), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

vision-agent by landing-ai

0.1%
5k
Visual AI agent for generating runnable vision code from image/video prompts
Created 1 year ago
Updated 2 weeks ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Forrest Iandola Forrest Iandola(Author of SqueezeNet; Research Scientist at Meta), and
17 more.

MiniGPT-4 by Vision-CAIR

0.0%
26k
Vision-language model for multi-task learning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.