Cheetah  by DCDmllm

Multimodal LLM for following zero-shot demonstrative instructions

Created 2 years ago
342 stars

Top 80.8% on SourcePulse

GitHubView on GitHub
Project Summary

Cheetah is a multimodal large language model designed to follow zero-shot demonstrative instructions, particularly those involving interleaved vision-language contexts. It targets researchers and developers working with complex visual reasoning and instruction-following tasks, offering enhanced comprehension of visual narratives and metaphorical implications.

How It Works

Cheetah utilizes a Visual Prompt Generator Complete (VPG-C) built upon a frozen LLM and vision encoder, specifically the Q-Former from BLIP-2. VPG-C leverages intermediate LLM outputs to guide visual attention, capturing missing visual details from images. These details are then merged back via a residual connection, enabling a comprehensive understanding of demonstrative instructions. This approach allows Cheetah to effectively process diverse interleaved vision-language inputs.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n cheetah python=3.8), activate it (conda activate cheetah), and install requirements (pip install -r requirement.txt).
  • Prerequisites: Requires Vicuna-7B or LLaMA2-7B weights, and a pretrained Q-Former from BLIP-2 aligned with FlanT5-XXL. Paths to these weights must be configured in YAML files.
  • Usage: Examples are provided in test_cheetah_vicuna.py and test_cheetah_llama2.py. Run tests using python test_cheetah_vicuna.py --cfg-path eval_configs/cheetah_eval_vicuna.yaml --gpu-id 0.
  • Resources: Requires GPU(s) for inference. Setup involves downloading model weights and configuring paths.

Highlighted Details

  • Achieved a Spotlight (Top 5%) at ICLR 2024.
  • Introduces the DEMON benchmark for comprehensive demonstrative instruction following.
  • Demonstrates strong reasoning over complicated interleaved vision-language instructions, including identifying causal relationships and understanding metaphorical implications.
  • Built upon the LAVIS library.

Maintenance & Community

The project is associated with multiple academic institutions (Zhejiang University, National University of Singapore, Nanyang Technological University). No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The current version requires manual preparation and configuration of specific LLM and Q-Former weights. A Gradio demo is planned for future release.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Forrest Iandola Forrest Iandola(Author of SqueezeNet; Research Scientist at Meta), and
17 more.

MiniGPT-4 by Vision-CAIR

0.0%
26k
Vision-language model for multi-task learning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.