Cheetah  by DCDmllm

Multimodal LLM for following zero-shot demonstrative instructions

created 2 years ago
345 stars

Top 81.4% on sourcepulse

GitHubView on GitHub
Project Summary

Cheetah is a multimodal large language model designed to follow zero-shot demonstrative instructions, particularly those involving interleaved vision-language contexts. It targets researchers and developers working with complex visual reasoning and instruction-following tasks, offering enhanced comprehension of visual narratives and metaphorical implications.

How It Works

Cheetah utilizes a Visual Prompt Generator Complete (VPG-C) built upon a frozen LLM and vision encoder, specifically the Q-Former from BLIP-2. VPG-C leverages intermediate LLM outputs to guide visual attention, capturing missing visual details from images. These details are then merged back via a residual connection, enabling a comprehensive understanding of demonstrative instructions. This approach allows Cheetah to effectively process diverse interleaved vision-language inputs.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n cheetah python=3.8), activate it (conda activate cheetah), and install requirements (pip install -r requirement.txt).
  • Prerequisites: Requires Vicuna-7B or LLaMA2-7B weights, and a pretrained Q-Former from BLIP-2 aligned with FlanT5-XXL. Paths to these weights must be configured in YAML files.
  • Usage: Examples are provided in test_cheetah_vicuna.py and test_cheetah_llama2.py. Run tests using python test_cheetah_vicuna.py --cfg-path eval_configs/cheetah_eval_vicuna.yaml --gpu-id 0.
  • Resources: Requires GPU(s) for inference. Setup involves downloading model weights and configuring paths.

Highlighted Details

  • Achieved a Spotlight (Top 5%) at ICLR 2024.
  • Introduces the DEMON benchmark for comprehensive demonstrative instruction following.
  • Demonstrates strong reasoning over complicated interleaved vision-language instructions, including identifying causal relationships and understanding metaphorical implications.
  • Built upon the LAVIS library.

Maintenance & Community

The project is associated with multiple academic institutions (Zhejiang University, National University of Singapore, Nanyang Technological University). No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The current version requires manual preparation and configuration of specific LLM and Q-Former weights. A Gradio demo is planned for future release.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.