visprog by allenai

Neuro-symbolic system for compositional visual reasoning using natural language

Created 2 years ago

759 stars

Top 46.0% on SourcePulse

Project Summary

This repository provides the official code for VisProg, a neuro-symbolic system designed for compositional visual reasoning based on natural language instructions. It targets researchers and developers working on complex visual question answering and image manipulation tasks, offering an interpretable and extensible framework.

How It Works

VisProg leverages GPT-3's in-context learning to generate Python programs that execute off-the-shelf computer vision models and image processing routines. This approach allows for compositional reasoning without requiring task-specific training, generating both solutions and interpretable execution rationales. The system is modular, enabling easy extension with new functionalities and tasks.

Quick Start & Requirements

Install dependencies using conda env create -f environment.yaml and activate with conda activate visprog.
Run provided Jupyter notebooks (e.g., notebooks/ok_det.ipynb, notebooks/image_editing.ipynb, notebooks/nlvr.ipynb, notebooks/gqa.ipynb).
Requires an OpenAI API key.
Official project page: https://visprog.github.io/
Arxiv Paper: https://arxiv.org/abs/2211.11559

Highlighted Details

CVPR 2023 Best Paper award winner.
Neuro-symbolic approach for compositional visual reasoning.
Generates Python programs for execution, providing interpretable rationales.
Modular design allows easy addition of new modules and tasks.
Swappable vision modules (e.g., BLIP for VQA).

Maintenance & Community

The project is associated with Allen Institute for AI (AI2).
The README mentions a successor project, CodeNav, which addresses VisProg's limitations.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

Performance is dependent on GPT-3's program generation capabilities and may fail on instructions significantly different from in-context examples.
Tasks not solvable by the current set of modules require manual addition of new modules.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

4 stars in the last 30 days

Explore Similar Projects

SmartEdit by TencentARC

Research paper for complex instruction-based image editing using multimodal LLMs

Created 2 years ago

Updated 1 year ago

OmniGen2 by VectorSpaceLab

Multimodal generation for text and images

Created 7 months ago

Updated 1 month ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm) and

Will Brown

Will Brown(Research Lead at Prime Intellect).

Phi-3-Vision-MLX by JosefAlbers

Apple Silicon framework for language and vision models

Created 1 year ago

Updated 2 months ago

Starred by

Bryan Helmig

Bryan Helmig(Cofounder of Zapier),

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai), and

1 more.

funcchain by shroominic

SDK for building cognitive systems with Python

Created 2 years ago

Updated 1 year ago

Thyme by yfzhang114

Multimodal reasoning and code execution for complex visual tasks

Created 4 months ago

Updated 3 months ago

Vitron by SkyworkAI

Vision LLM research paper for pixel-level understanding, generation, segmentation & editing

Created 1 year ago

Updated 1 year ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium).

BakLLaVA by SkunkworksAI

Multimodal model for visual instruction tuning, enhanced from LLaVA

Created 2 years ago

Updated 1 year ago

Starred by

Haotian Liu

Haotian Liu(Author of LLaVA; Research Scientist at xAI) and

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

LLaVA-Plus-Codebase by LLaVA-VL

Multimodal agent for vision tasks using external tools

Created 2 years ago

Updated 1 year ago

PandaGPT by yxuansu

Multimodal model for instruction following across six modalities

Created 2 years ago

Updated 2 years ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

3 more.

viper by cvlab-columbia

ViperGPT: Visual inference via Python execution

Created 2 years ago

Updated 1 year ago

Starred by

Andrew Ng

Andrew Ng(Founder of DeepLearning.AI; Cofounder of Coursera; Professor at Stanford),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

2 more.

vision-agent by landing-ai

Visual AI agent for generating runnable vision code from image/video prompts

Created 1 year ago

Updated 1 month ago

minimind-v by jingyaogong

VLM for training vision-language models from scratch

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.