GPT4Tools  by AILab-CVC

Intelligent system for visual foundation model control via LLM

created 2 years ago
774 stars

Top 46.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GPT4Tools is an intelligent system designed to enable conversational interaction with images by automatically selecting, controlling, and utilizing various visual foundation models. It targets users who need to perform image-related tasks within a conversational context, offering a unified interface for diverse visual operations.

How It Works

GPT4Tools leverages a Vicuna-based Large Language Model (LLM) fine-tuned on 71K self-built instruction data. The core approach involves the LLM analyzing conversational content to dynamically decide which visual foundation model (tool) to invoke and how to control it. This self-instructional fine-tuning allows the LLM to learn to use a suite of 22 integrated visual tools, facilitating seamless image manipulation and analysis during conversations.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Requires downloading Vicuna base models (e.g., lmsys/vicuna-13b-v1.5) and GPT4Tools LoRA weights. Additional visual model weights (e.g., Stable Diffusion, BLIP, ControlNet) may need to be downloaded.
  • Resources: The demo script suggests configurations for 1 or 4 GPUs, with specific tool assignments to CUDA devices. Fine-tuning requires DeepSpeed and significant computational resources.
  • Links: Project Page, Online Demo, Dataset.

Highlighted Details

  • Supports 22 integrated visual tools, including image captioning, VQA, segmentation, inpainting, and ControlNet variations.
  • Offers a flexible and extensible architecture allowing users to add new tools or replace existing LLMs.
  • Provides 71K self-instructional data for fine-tuning and model adaptation via LoRA.
  • Paper accepted at NIPS 2023.

Maintenance & Community

The project is actively updated, with recent releases supporting Vicuna-v1.5 and new demos. Key contributors are listed as authors of the associated paper.

Licensing & Compatibility

The project releases LoRA weights to comply with the LLaMA model license. Compatibility with commercial or closed-source applications would depend on the underlying LLaMA and Vicuna licenses.

Limitations & Caveats

The system relies on specific versions of Vicuna and requires careful management of model and tool weight downloads. The multi-GPU setup advice indicates a significant hardware requirement for optimal performance.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.