Osprey  by CircleRadon

Research paper for pixel understanding via visual instruction tuning

Created 1 year ago
830 stars

Top 42.8% on SourcePulse

GitHubView on GitHub
Project Summary

Osprey is a multimodal large language model (MLLM) designed for fine-grained pixel-level image understanding. It enables MLLMs to generate semantic descriptions of specific image regions by incorporating mask-text pairs into visual instruction tuning, benefiting researchers and developers working with detailed visual analysis and image captioning.

How It Works

Osprey extends existing MLLMs by integrating pixel-wise mask regions into language instructions. This approach allows the model to focus on specific objects or parts of an image, generating both short and detailed semantic descriptions. It leverages the Segment Anything Model (SAM) for mask generation, supporting point, box, and "segment everything" prompts for versatile input.

Quick Start & Requirements

  • Install: pip install -e . (after cloning the repo). Additional packages for training: pip install -e ".[train]", pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.10, PyTorch, Hugging Face Transformers, Gradio, SAM. For offline demo: ~17GB GPU memory (15GB for Osprey, 2GB for SAM).
  • Checkpoints: Requires downloading Osprey-7b, CLIP-convnext, and SAM ViT-B models.
  • Demo: Online demo available at http://111.0.123.204:8000/ (username: osprey, password: osprey).
  • Dataset: Osprey-724K dataset available on Hugging Face.

Highlighted Details

  • CVPR2024 accepted paper.
  • Integrates with SAM for mask-based visual understanding.
  • Supports object-level, part-level, and general instruction samples.
  • Released Osprey-Chat model with improved conversational and reasoning capabilities.

Maintenance & Community

  • Project is associated with CVPR2025 accepted work "VideoRefer Suite".
  • Metrics defined by Osprey have been adopted in other research projects (ChatRex, Describe Anything Model).
  • Codebase built upon LLaVA-v1.5.

Licensing & Compatibility

  • The repository does not explicitly state a license. The underlying LLaVA codebase is Apache 2.0. Model weights are typically released under specific licenses (e.g., Llama 2 license for Vicuna). Users should verify compatibility for commercial use.

Limitations & Caveats

The project does not explicitly state limitations or caveats in the README. The training process involves multiple stages, and users must download several large checkpoints.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.