Discover and explore top open-source AI tools and projects—updated daily.
jefferyZhanLarge multimodal models for advanced visual reasoning and perception
Top 100.0% on SourcePulse
Summary Griffon is a series of large multimodal models (LMMs) for advanced visual reasoning, perception, and localization. It targets researchers and engineers seeking state-of-the-art performance in visual grounding, referring expression comprehension (REC), and general question answering, enabling models to understand, think, and answer based on visual input.
How It Works
Griffon employs large multimodal models for fine-grained perception and reasoning. Key iterations include Griffon v1 (ECCV 2024) for detailed object localization, Griffon v2 (ICCV 2025) with high-resolution scaling and co-referring, and Griffon-G bridging vision-language and vision-centric tasks. Vision-R1 further advances alignment via vision-guided reinforcement learning.
Quick Start & Requirements
Installation: git clone and pip install -e .. Requires downloading pre-trained Griffon-G (9B, 27B) and CLIP models to a checkpoints folder. Inference needs CUDA, model paths, and image locations. Evaluation uses LLaVA Evaluation/VLMEvalKit for multimodal benchmarks, with specific scripts for COCO detection and REC tasks, requiring corresponding datasets.
Highlighted Details
Maintenance & Community
The project is actively maintained with ongoing releases and updates. Developers encourage community contributions via pull requests. Acknowledgements suggest integration with LLaVA and Llama projects.
Licensing & Compatibility
Data and checkpoints are strictly for research use only, adhering to LLaVA, LLaMA, Gemma2, and GPT-4 licenses. The dataset is CC BY-NC 4.0 (non-commercial use only). Models trained on this dataset are also restricted to research purposes, precluding commercial applications.
Limitations & Caveats
The primary limitation is the non-commercial restriction on all data and model checkpoints, making it unsuitable for commercial deployment. Training codes for Griffon-G were announced as forthcoming. Multimodal benchmark evaluation relies on external toolkits.
8 months ago
Inactive
huggingface