viper  by cvlab-columbia

ViperGPT: Visual inference via Python execution

Created 2 years ago
1,708 stars

Top 24.9% on SourcePulse

GitHubView on GitHub
Project Summary

ViperGPT enables visual reasoning by generating and executing Python code using large language models. It's designed for researchers and developers working with multimodal AI, offering a framework to bridge visual understanding with programmatic problem-solving.

How It Works

ViperGPT leverages LLMs (GPT-3.5 Turbo, GPT-4) to interpret visual inputs and generate Python code for analysis. This code interacts with various vision models (e.g., GLIP, BLIP2) to perform tasks like object detection, segmentation, and captioning. The generated code can be executed directly or reviewed manually, providing a flexible and powerful approach to visual inference.

Quick Start & Requirements

  • Install via git clone --recurse-submodules and running bash setup.sh within the cloned directory.
  • Requires CUDA, Python, and a Conda environment (setup_env.sh).
  • Manual download of two pretrained models is necessary; others download automatically.
  • An OpenAI API key is required, placed in api.key.
  • Official docs: https://github.com/cvlab-columbia/viper

Highlighted Details

  • Supports GPT-3.5 Turbo and GPT-4, with notes on potential differences from the discontinued Codex API.
  • Offers a multiprocessing architecture for parallel model and sample execution.
  • Includes a flexible configuration system via YAML files.
  • Provides main_simple.ipynb for interactive exploration and main_batch.py for dataset processing.

Maintenance & Community

  • Developed by cvlab-columbia.
  • Citation details provided for academic use.

Licensing & Compatibility

  • No explicit license is mentioned in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

ViperGPT executes LLM-generated code, posing potential security risks; users are advised to run in sandboxed environments. The project notes that GPT-3.5/GPT-4 are chat models and their behavior may differ from completion models. Pretrained models may contain biases.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Andrew Ng Andrew Ng(Founder of DeepLearning.AI; Cofounder of Coursera; Professor at Stanford), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

vision-agent by landing-ai

0.1%
5k
Visual AI agent for generating runnable vision code from image/video prompts
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.