Griffon by jefferyZhan

Large multimodal models for advanced visual reasoning and perception

Created 2 years ago

250 stars

Top 100.0% on SourcePulse

Project Summary

Summary Griffon is a series of large multimodal models (LMMs) for advanced visual reasoning, perception, and localization. It targets researchers and engineers seeking state-of-the-art performance in visual grounding, referring expression comprehension (REC), and general question answering, enabling models to understand, think, and answer based on visual input.

How It Works

Griffon employs large multimodal models for fine-grained perception and reasoning. Key iterations include Griffon v1 (ECCV 2024) for detailed object localization, Griffon v2 (ICCV 2025) with high-resolution scaling and co-referring, and Griffon-G bridging vision-language and vision-centric tasks. Vision-R1 further advances alignment via vision-guided reinforcement learning.

Quick Start & Requirements

Installation: git clone and pip install -e .. Requires downloading pre-trained Griffon-G (9B, 27B) and CLIP models to a checkpoints folder. Inference needs CUDA, model paths, and image locations. Evaluation uses LLaVA Evaluation/VLMEvalKit for multimodal benchmarks, with specific scripts for COCO detection and REC tasks, requiring corresponding datasets.

Highlighted Details

Achieves state-of-the-art results in visual grounding, REC, and object detection.
Supports fine-grained object localization at any granularity and general visual question answering.
Griffon v2 introduces high-resolution scaling and visual-language co-referring.
Vision-R1 framework offers novel human-free alignment via reinforcement learning.

Maintenance & Community

The project is actively maintained with ongoing releases and updates. Developers encourage community contributions via pull requests. Acknowledgements suggest integration with LLaVA and Llama projects.

Licensing & Compatibility

Data and checkpoints are strictly for research use only, adhering to LLaVA, LLaMA, Gemma2, and GPT-4 licenses. The dataset is CC BY-NC 4.0 (non-commercial use only). Models trained on this dataset are also restricted to research purposes, precluding commercial applications.

Limitations & Caveats

The primary limitation is the non-commercial restriction on all data and model checkpoints, making it unsuitable for commercial deployment. Training codes for Griffon-G were announced as forthcoming. Multimodal benchmark evaluation relies on external toolkits.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days