recognize-anything  by xinyu1205

Image tagging models for common/open-set categories and comprehensive captioning

Created 2 years ago
3,414 stars

Top 14.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a suite of open-source image recognition models, including RAM++, RAM, and Tag2Text, designed for high-accuracy image tagging and comprehensive captioning. It targets researchers and developers seeking robust visual semantic analysis capabilities, offering strong zero-shot generalization and the ability to recognize both common and open-set categories.

How It Works

The models leverage a Swin Transformer backbone and are trained on large-scale datasets like COCO, VG, SBU, and CC3M/CC12M. RAM++ and RAM excel at image tagging by utilizing a data engine for annotation generation and cleaning, achieving superior accuracy and zero-shot performance compared to models like CLIP and BLIP. Tag2Text integrates tagging information into text generation, enabling controllable and comprehensive image captioning.

Quick Start & Requirements

  • Install: pip install git+https://github.com/xinyu1205/recognize-anything.git
  • Prerequisites: Python 3.8+, PyTorch. Pre-trained checkpoints are required and must be downloaded into a pretrained folder.
  • Setup: Requires downloading model checkpoints. Inference examples are provided.
  • Docs: Demo, Papers

Highlighted Details

  • RAM++ outperforms SOTA models on common, uncommon, and human-object interaction phrases.
  • RAM achieves competitive performance with Google's tagging API and surpasses fully supervised methods in zero-shot scenarios.
  • Tag2Text supports simultaneous tagging and captioning, with tags guiding text generation for enhanced control.
  • The project integrates with Grounding-DINO and SAM for a visual semantic analysis pipeline.

Maintenance & Community

The project acknowledges contributions from various individuals and mentions integration with other projects like Grounded-SAM, Ask-Anything, and Prompt-can-anything.

Licensing & Compatibility

The project is released under an unspecified license. The README does not explicitly state licensing terms, which may impact commercial use or closed-source linking.

Limitations & Caveats

Training and fine-tuning require significant computational resources (e.g., 8 A100 GPUs). The project relies on OpenAI API keys for generating custom tag descriptions, which incurs costs. The license is not explicitly stated, requiring further investigation for commercial applications.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0.1%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.