VisualSketchpad by Yushi-Hu

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal LLMs

Created 1 year ago

276 stars

Top 93.9% on SourcePulse

Project Summary

This repository provides the code for Visual Sketchpad, a system that uses sketching as a visual chain of thought for multimodal language models. It is designed for researchers and developers working with multimodal AI, enabling more intuitive and structured reasoning for complex tasks.

How It Works

Visual Sketchpad leverages a modular agent-based architecture, built upon pyautogen. It orchestrates multiple "expert" agents, including language models and specialized vision models (like SOM, GroundingDINO, and Depth-Anything), to collaboratively solve tasks. The system processes inputs, generates intermediate visual "sketches" or representations, and uses these as a chain of thought to arrive at a final answer, mimicking a human's visual reasoning process.

Quick Start & Requirements

Installation: Use conda create -n sketchpad python=3.9 pip install pyautogen==0.2.26 followed by pip install 'pyautogen[jupyter-executor]' Pillow joblib matplotlib opencv-python numpy gradio gradio_client networkx scipy datasets.
Prerequisites: OpenAI API key, Python 3.9. For vision tasks, separate Gradio servers for vision experts (SOM, GroundingDINO, Depth-Anything) must be installed and their web links configured.
Data: Preprocessed task data is available via Google Drive link in the README.
Resources: Requires significant setup for vision experts.
Links: 🌐 Homepage | 📖 arXiv | 📑 Paper

Highlighted Details

Implements a "visual chain of thought" for multimodal reasoning.
Modular design allows integration of various vision experts via Gradio servers.
Supports both math/geometry and computer vision tasks.
Includes example task execution scripts and agent trajectory visualizations.

Maintenance & Community

The project was accepted to NeurIPS 2024. Recent updates address environment robustness and potential bugs. Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Setting up the vision experts requires separate installations and configuration of Gradio servers, which can be complex. The system relies on external API keys (OpenAI) and potentially large datasets.

VisualSketchpad by Yushi-Hu

Explore Similar Projects

DeepDiagram by twwch

JARVIS-1 by CraftJarvis

Compositional-Visual-Reasoning-Survey by pokerme7777

Awesome_Think_With_Images by zhaochen0110

DeepEyesV2 by Visual-Agent

TuriX-CUA by TurixAI

langchain-code by zamalali

Seed1.5-VL by ByteDance-Seed

LLaVA-Plus-Codebase by LLaVA-VL

ChatTutor by HugeCatLab

Magma by microsoft

awesome-claude-skills by VoltAgent