Multimodal LLM for following zero-shot demonstrative instructions
Top 81.4% on sourcepulse
Cheetah is a multimodal large language model designed to follow zero-shot demonstrative instructions, particularly those involving interleaved vision-language contexts. It targets researchers and developers working with complex visual reasoning and instruction-following tasks, offering enhanced comprehension of visual narratives and metaphorical implications.
How It Works
Cheetah utilizes a Visual Prompt Generator Complete (VPG-C) built upon a frozen LLM and vision encoder, specifically the Q-Former from BLIP-2. VPG-C leverages intermediate LLM outputs to guide visual attention, capturing missing visual details from images. These details are then merged back via a residual connection, enabling a comprehensive understanding of demonstrative instructions. This approach allows Cheetah to effectively process diverse interleaved vision-language inputs.
Quick Start & Requirements
conda create -n cheetah python=3.8
), activate it (conda activate cheetah
), and install requirements (pip install -r requirement.txt
).test_cheetah_vicuna.py
and test_cheetah_llama2.py
. Run tests using python test_cheetah_vicuna.py --cfg-path eval_configs/cheetah_eval_vicuna.yaml --gpu-id 0
.Highlighted Details
Maintenance & Community
The project is associated with multiple academic institutions (Zhejiang University, National University of Singapore, Nanyang Technological University). No specific community channels or roadmap are mentioned in the README.
Licensing & Compatibility
Limitations & Caveats
The current version requires manual preparation and configuration of specific LLM and Q-Former weights. A Gradio demo is planned for future release.
1 year ago
1 week