Discover and explore top open-source AI tools and projects—updated daily.
DCDmllmMultimodal LLM for following zero-shot demonstrative instructions
Top 79.1% on SourcePulse
Cheetah is a multimodal large language model designed to follow zero-shot demonstrative instructions, particularly those involving interleaved vision-language contexts. It targets researchers and developers working with complex visual reasoning and instruction-following tasks, offering enhanced comprehension of visual narratives and metaphorical implications.
How It Works
Cheetah utilizes a Visual Prompt Generator Complete (VPG-C) built upon a frozen LLM and vision encoder, specifically the Q-Former from BLIP-2. VPG-C leverages intermediate LLM outputs to guide visual attention, capturing missing visual details from images. These details are then merged back via a residual connection, enabling a comprehensive understanding of demonstrative instructions. This approach allows Cheetah to effectively process diverse interleaved vision-language inputs.
Quick Start & Requirements
conda create -n cheetah python=3.8), activate it (conda activate cheetah), and install requirements (pip install -r requirement.txt).test_cheetah_vicuna.py and test_cheetah_llama2.py. Run tests using python test_cheetah_vicuna.py --cfg-path eval_configs/cheetah_eval_vicuna.yaml --gpu-id 0.Highlighted Details
Maintenance & Community
The project is associated with multiple academic institutions (Zhejiang University, National University of Singapore, Nanyang Technological University). No specific community channels or roadmap are mentioned in the README.
Licensing & Compatibility
Limitations & Caveats
The current version requires manual preparation and configuration of specific LLM and Q-Former weights. A Gradio demo is planned for future release.
1 year ago
Inactive
X-PLUG
deepseek-ai
Vision-CAIR