prophet  by MILVLG

VQA framework using answer heuristics to prompt LLMs

created 2 years ago
276 stars

Top 94.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for Prophet, a two-stage framework for knowledge-based Visual Question Answering (VQA). It targets researchers and practitioners in computer vision and NLP, enabling improved VQA performance by leveraging GPT-3 with answer heuristics derived from a trained VQA model.

How It Works

Prophet employs a two-stage approach. Stage one involves training a standard VQA model (MCAN) on a specific dataset, from which answer heuristics—candidate answers and answer-aware examples—are extracted. Stage two utilizes these heuristics to prompt GPT-3, guiding it to generate more accurate answers. This method significantly outperforms existing state-of-the-art on OK-VQA and A-OKVQA datasets.

Quick Start & Requirements

  • Installation: Create a conda environment using conda env create -f environment.yml.
  • Prerequisites: Python >= 3.9, CUDA >= 11.3, PyTorch >= 12.0.
  • Hardware: Recommended: 1x RTX 3090 GPU, 50GB RAM, 300GB disk space (SSD recommended).
  • Data: Download MSCOCO 2014/2017 datasets and run bash scripts/extract_img_feats.sh.
  • Documentation: Usage details and scripts are available in the scripts directory.

Highlighted Details

  • Achieves 61.1% accuracy on OK-VQA and 55.7% on A-OKVQA.
  • Utilizes an MCAN model for initial VQA and heuristic extraction.
  • Prompts GPT-3 with generated answer candidates and examples.
  • Provides pre-trained and fine-tuned models for OK-VQA and A-OKVQA.

Maintenance & Community

  • The project is associated with the CVPR 2023 paper "Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering."
  • Updates were made in April and March 2023.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The framework requires significant computational resources and disk space for data preparation and model training. It also relies on access to the OpenAI GPT-3 API, which may incur costs.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.