Osprey  by CircleRadon

Research paper for pixel understanding via visual instruction tuning

created 1 year ago
826 stars

Top 43.9% on sourcepulse

GitHubView on GitHub
Project Summary

Osprey is a multimodal large language model (MLLM) designed for fine-grained pixel-level image understanding. It enables MLLMs to generate semantic descriptions of specific image regions by incorporating mask-text pairs into visual instruction tuning, benefiting researchers and developers working with detailed visual analysis and image captioning.

How It Works

Osprey extends existing MLLMs by integrating pixel-wise mask regions into language instructions. This approach allows the model to focus on specific objects or parts of an image, generating both short and detailed semantic descriptions. It leverages the Segment Anything Model (SAM) for mask generation, supporting point, box, and "segment everything" prompts for versatile input.

Quick Start & Requirements

  • Install: pip install -e . (after cloning the repo). Additional packages for training: pip install -e ".[train]", pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.10, PyTorch, Hugging Face Transformers, Gradio, SAM. For offline demo: ~17GB GPU memory (15GB for Osprey, 2GB for SAM).
  • Checkpoints: Requires downloading Osprey-7b, CLIP-convnext, and SAM ViT-B models.
  • Demo: Online demo available at http://111.0.123.204:8000/ (username: osprey, password: osprey).
  • Dataset: Osprey-724K dataset available on Hugging Face.

Highlighted Details

  • CVPR2024 accepted paper.
  • Integrates with SAM for mask-based visual understanding.
  • Supports object-level, part-level, and general instruction samples.
  • Released Osprey-Chat model with improved conversational and reasoning capabilities.

Maintenance & Community

  • Project is associated with CVPR2025 accepted work "VideoRefer Suite".
  • Metrics defined by Osprey have been adopted in other research projects (ChatRex, Describe Anything Model).
  • Codebase built upon LLaVA-v1.5.

Licensing & Compatibility

  • The repository does not explicitly state a license. The underlying LLaVA codebase is Apache 2.0. Model weights are typically released under specific licenses (e.g., Llama 2 license for Vicuna). Users should verify compatibility for commercial use.

Limitations & Caveats

The project does not explicitly state limitations or caveats in the README. The training process involves multiple stages, and users must download several large checkpoints.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.