Osprey by CircleRadon

Research paper for pixel understanding via visual instruction tuning

Created 2 years ago

837 stars

Top 42.6% on SourcePulse

Project Summary

Osprey is a multimodal large language model (MLLM) designed for fine-grained pixel-level image understanding. It enables MLLMs to generate semantic descriptions of specific image regions by incorporating mask-text pairs into visual instruction tuning, benefiting researchers and developers working with detailed visual analysis and image captioning.

How It Works

Osprey extends existing MLLMs by integrating pixel-wise mask regions into language instructions. This approach allows the model to focus on specific objects or parts of an image, generating both short and detailed semantic descriptions. It leverages the Segment Anything Model (SAM) for mask generation, supporting point, box, and "segment everything" prompts for versatile input.

Quick Start & Requirements

Install: pip install -e . (after cloning the repo). Additional packages for training: pip install -e ".[train]", pip install flash-attn --no-build-isolation.
Prerequisites: Python 3.10, PyTorch, Hugging Face Transformers, Gradio, SAM. For offline demo: ~17GB GPU memory (15GB for Osprey, 2GB for SAM).
Checkpoints: Requires downloading Osprey-7b, CLIP-convnext, and SAM ViT-B models.
Demo: Online demo available at http://111.0.123.204:8000/ (username: osprey, password: osprey).
Dataset: Osprey-724K dataset available on Hugging Face.

Highlighted Details

CVPR2024 accepted paper.
Integrates with SAM for mask-based visual understanding.
Supports object-level, part-level, and general instruction samples.
Released Osprey-Chat model with improved conversational and reasoning capabilities.

Maintenance & Community

Project is associated with CVPR2025 accepted work "VideoRefer Suite".
Metrics defined by Osprey have been adopted in other research projects (ChatRex, Describe Anything Model).
Codebase built upon LLaVA-v1.5.

Licensing & Compatibility

The repository does not explicitly state a license. The underlying LLaVA codebase is Apache 2.0. Model weights are typically released under specific licenses (e.g., Llama 2 license for Vicuna). Users should verify compatibility for commercial use.

Limitations & Caveats

The project does not explicitly state limitations or caveats in the README. The training process involves multiple stages, and users must download several large checkpoints.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days