Caption-Anything  by ttengwang

Image processing tool for generating tailored captions

Created 2 years ago
1,761 stars

Top 24.3% on SourcePulse

GitHubView on GitHub
Project Summary

Caption-Anything is an interactive image captioning tool that leverages object segmentation, visual grounding, and large language models (LLMs) to generate descriptive text for specific image regions. It targets users who need fine-grained control over caption generation, offering both visual (e.g., click-based selection) and linguistic (e.g., length, sentiment) parameters for tailored output.

How It Works

The system integrates Segment Anything (SAM) for precise object segmentation, a visual captioning model (like BLIP-2) to generate initial descriptions, and ChatGPT for conversational refinement and style control. Users can interactively select objects via mouse clicks, and then prompt the LLM to modify captions based on desired sentiment, length, or language, enabling detailed object-specific discussions.

Quick Start & Requirements

  • Install via pip install -r requirements.txt.
  • Requires Python >= 3.8.1.
  • Needs an OpenAI API key.
  • GPU memory requirements vary: huge segmenter + blip2 captioner requires ~13GB; base segmenter + blip2 requires ~8.5GB; base segmenter + blip requires ~5.5GB.
  • Pre-downloaded SAM checkpoints are recommended for specific configurations.
  • Demo: https://huggingface.co/spaces/TencentARC/Caption-Anything

Highlighted Details

  • Interactive object selection via mouse clicks.
  • Language controls for caption length, sentiment, and language.
  • Supports "chatting" about selected objects for deeper understanding.
  • Integrates with LangChain and Visual Question Answering (VQA).
  • Recent updates include support for captioning paragraphs and mouse trajectory as a visual control.

Maintenance & Community

The project is primarily driven by Teng Wang and contributors from various institutions including Southern University of Science and Technology, HKU, and Tencent ARC Lab.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The project relies on external APIs (OpenAI) which may incur costs. GPU memory requirements can be substantial for higher-fidelity models. The "mouse trajectory" control is noted as beta.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0.1%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
1 more.

InternGPT by OpenGVLab

0.1%
3k
Interactive demo platform for showcasing AI models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.