describe-anything  by NVlabs

Image/video captioning model for detailed localized descriptions

Created 5 months ago
1,334 stars

Top 30.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the "Describe Anything" model (DAM), a system for generating detailed, localized descriptions of regions within images and videos. It's designed for researchers and developers working on advanced computer vision and natural language processing tasks, offering precise captioning for user-specified areas.

How It Works

DAM leverages a foundation model that takes image or video regions, defined by points, boxes, or masks, and outputs detailed textual descriptions. For videos, annotations on a single frame are sufficient, with the model handling temporal propagation. This approach allows for highly specific and context-aware captioning, going beyond global scene descriptions.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/NVlabs/describe-anything
  • Dependencies: Python, Gradio (for demos), SAM/SAM2 (for segmentation). Specific model weights are hosted on HuggingFace.
  • Demos: Includes interactive Gradio demos for image and video captioning, with optional integration of SAM for automated mask generation.
  • Resources: A self-contained script is available for image descriptions without full package installation.
  • API: An OpenAI-compatible API server (dam_server.py) is provided for integration.
  • Links: Paper, Project Page, HuggingFace Demo, Model/Benchmark/Datasets

Highlighted Details

  • Supports detailed localized captioning for both images and videos.
  • Introduces DLC-Bench, a new benchmark for evaluating Detailed Localized Captioning models.
  • Offers an OpenAI-compatible API for seamless integration into existing workflows.
  • Includes example scripts for command-line usage, interactive demos, and API interaction.

Maintenance & Community

Developed by NVlabs, UC Berkeley, and UCSF. Links to HuggingFace for models and datasets are provided.

Licensing & Compatibility

  • Code: Apache License 2.0
  • Model Weights & Data: NVIDIA Noncommercial License
  • DLC-Bench: CC BY-NC-SA 4.0
  • Commercial Use: The non-commercial license for model weights restricts commercial applications.

Limitations & Caveats

Model weights are released under a non-commercial license, limiting their use in commercial products. The project is associated with a 2025 arXiv preprint, suggesting it may be a recent research release.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
28 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0.1%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.