describe-anything  by NVlabs

Image/video captioning model for detailed localized descriptions

created 4 months ago
1,293 stars

Top 31.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the "Describe Anything" model (DAM), a system for generating detailed, localized descriptions of regions within images and videos. It's designed for researchers and developers working on advanced computer vision and natural language processing tasks, offering precise captioning for user-specified areas.

How It Works

DAM leverages a foundation model that takes image or video regions, defined by points, boxes, or masks, and outputs detailed textual descriptions. For videos, annotations on a single frame are sufficient, with the model handling temporal propagation. This approach allows for highly specific and context-aware captioning, going beyond global scene descriptions.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/NVlabs/describe-anything
  • Dependencies: Python, Gradio (for demos), SAM/SAM2 (for segmentation). Specific model weights are hosted on HuggingFace.
  • Demos: Includes interactive Gradio demos for image and video captioning, with optional integration of SAM for automated mask generation.
  • Resources: A self-contained script is available for image descriptions without full package installation.
  • API: An OpenAI-compatible API server (dam_server.py) is provided for integration.
  • Links: Paper, Project Page, HuggingFace Demo, Model/Benchmark/Datasets

Highlighted Details

  • Supports detailed localized captioning for both images and videos.
  • Introduces DLC-Bench, a new benchmark for evaluating Detailed Localized Captioning models.
  • Offers an OpenAI-compatible API for seamless integration into existing workflows.
  • Includes example scripts for command-line usage, interactive demos, and API interaction.

Maintenance & Community

Developed by NVlabs, UC Berkeley, and UCSF. Links to HuggingFace for models and datasets are provided.

Licensing & Compatibility

  • Code: Apache License 2.0
  • Model Weights & Data: NVIDIA Noncommercial License
  • DLC-Bench: CC BY-NC-SA 4.0
  • Commercial Use: The non-commercial license for model weights restricts commercial applications.

Limitations & Caveats

Model weights are released under a non-commercial license, limiting their use in commercial products. The project is associated with a 2025 arXiv preprint, suggesting it may be a recent research release.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
388 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.