Cheetah by DCDmllm

Multimodal LLM for following zero-shot demonstrative instructions

Created 2 years ago

350 stars

Top 79.5% on SourcePulse

Project Summary

Cheetah is a multimodal large language model designed to follow zero-shot demonstrative instructions, particularly those involving interleaved vision-language contexts. It targets researchers and developers working with complex visual reasoning and instruction-following tasks, offering enhanced comprehension of visual narratives and metaphorical implications.

How It Works

Cheetah utilizes a Visual Prompt Generator Complete (VPG-C) built upon a frozen LLM and vision encoder, specifically the Q-Former from BLIP-2. VPG-C leverages intermediate LLM outputs to guide visual attention, capturing missing visual details from images. These details are then merged back via a residual connection, enabling a comprehensive understanding of demonstrative instructions. This approach allows Cheetah to effectively process diverse interleaved vision-language inputs.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n cheetah python=3.8), activate it (conda activate cheetah), and install requirements (pip install -r requirement.txt).
Prerequisites: Requires Vicuna-7B or LLaMA2-7B weights, and a pretrained Q-Former from BLIP-2 aligned with FlanT5-XXL. Paths to these weights must be configured in YAML files.
Usage: Examples are provided in test_cheetah_vicuna.py and test_cheetah_llama2.py. Run tests using python test_cheetah_vicuna.py --cfg-path eval_configs/cheetah_eval_vicuna.yaml --gpu-id 0.
Resources: Requires GPU(s) for inference. Setup involves downloading model weights and configuring paths.

Highlighted Details

Achieved a Spotlight (Top 5%) at ICLR 2024.
Introduces the DEMON benchmark for comprehensive demonstrative instruction following.
Demonstrates strong reasoning over complicated interleaved vision-language instructions, including identifying causal relationships and understanding metaphorical implications.
Built upon the LAVIS library.

Maintenance & Community

The project is associated with multiple academic institutions (Zhejiang University, National University of Singapore, Nanyang Technological University). No specific community channels or roadmap are mentioned in the README.

Cheetah by DCDmllm

Explore Similar Projects

cobra by h-zhao1997

MIC by HaozheZhao

HPT by HyperGAI

KoLLaVA by tabtoyou

ScreenAI by kyegomez

llava-phi by xmoanvaf

Gemini by kyegomez

PandaGPT by yxuansu

mPLUG-Owl by X-PLUG

MGM by JIA-Lab-research

DeepSeek-VL2 by deepseek-ai

MiniGPT-4 by Vision-CAIR