VisualSketchpad  by Yushi-Hu

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal LLMs

Created 1 year ago
258 stars

Top 98.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the code for Visual Sketchpad, a system that uses sketching as a visual chain of thought for multimodal language models. It is designed for researchers and developers working with multimodal AI, enabling more intuitive and structured reasoning for complex tasks.

How It Works

Visual Sketchpad leverages a modular agent-based architecture, built upon pyautogen. It orchestrates multiple "expert" agents, including language models and specialized vision models (like SOM, GroundingDINO, and Depth-Anything), to collaboratively solve tasks. The system processes inputs, generates intermediate visual "sketches" or representations, and uses these as a chain of thought to arrive at a final answer, mimicking a human's visual reasoning process.

Quick Start & Requirements

  • Installation: Use conda create -n sketchpad python=3.9 pip install pyautogen==0.2.26 followed by pip install 'pyautogen[jupyter-executor]' Pillow joblib matplotlib opencv-python numpy gradio gradio_client networkx scipy datasets.
  • Prerequisites: OpenAI API key, Python 3.9. For vision tasks, separate Gradio servers for vision experts (SOM, GroundingDINO, Depth-Anything) must be installed and their web links configured.
  • Data: Preprocessed task data is available via Google Drive link in the README.
  • Resources: Requires significant setup for vision experts.
  • Links: 🌐 Homepage | 📖 arXiv | 📑 Paper

Highlighted Details

  • Implements a "visual chain of thought" for multimodal reasoning.
  • Modular design allows integration of various vision experts via Gradio servers.
  • Supports both math/geometry and computer vision tasks.
  • Includes example task execution scripts and agent trajectory visualizations.

Maintenance & Community

The project was accepted to NeurIPS 2024. Recent updates address environment robustness and potential bugs. Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Setting up the vision experts requires separate installations and configuration of Gradio servers, which can be complex. The system relies on external API keys (OpenAI) and potentially large datasets.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.