describe-anything by NVlabs

Image/video captioning model for detailed localized descriptions

Created 9 months ago

1,440 stars

Top 28.1% on SourcePulse

Project Summary

This repository provides the "Describe Anything" model (DAM), a system for generating detailed, localized descriptions of regions within images and videos. It's designed for researchers and developers working on advanced computer vision and natural language processing tasks, offering precise captioning for user-specified areas.

How It Works

DAM leverages a foundation model that takes image or video regions, defined by points, boxes, or masks, and outputs detailed textual descriptions. For videos, annotations on a single frame are sufficient, with the model handling temporal propagation. This approach allows for highly specific and context-aware captioning, going beyond global scene descriptions.

Quick Start & Requirements

Installation: pip install git+https://github.com/NVlabs/describe-anything
Dependencies: Python, Gradio (for demos), SAM/SAM2 (for segmentation). Specific model weights are hosted on HuggingFace.
Demos: Includes interactive Gradio demos for image and video captioning, with optional integration of SAM for automated mask generation.
Resources: A self-contained script is available for image descriptions without full package installation.
API: An OpenAI-compatible API server (dam_server.py) is provided for integration.
Links: Paper, Project Page, HuggingFace Demo, Model/Benchmark/Datasets

Highlighted Details

Supports detailed localized captioning for both images and videos.
Introduces DLC-Bench, a new benchmark for evaluating Detailed Localized Captioning models.
Offers an OpenAI-compatible API for seamless integration into existing workflows.
Includes example scripts for command-line usage, interactive demos, and API interaction.

Maintenance & Community

Developed by NVlabs, UC Berkeley, and UCSF. Links to HuggingFace for models and datasets are provided.

Licensing & Compatibility

Code: Apache License 2.0
Model Weights & Data: NVIDIA Noncommercial License
DLC-Bench: CC BY-NC-SA 4.0
Commercial Use: The non-commercial license for model weights restricts commercial applications.

Limitations & Caveats

Model weights are released under a non-commercial license, limiting their use in commercial products. The project is associated with a 2025 arXiv preprint, suggesting it may be a recent research release.

describe-anything by NVlabs

Explore Similar Projects

lens by ContextualAI

tarsier by bytedance

Osprey by CircleRadon

ShareGPT4Video by ShareGPT4Omni

VisualGPT by Vision-CAIR

grounded-video-description by facebookresearch

Emu3 by baaivision

Sa2VA by bytedance

CLIP_prefix_caption by rmokady

sdnext by vladmandic

ShortGPT by RayVentura

LAVIS by salesforce