recognize-anything by xinyu1205

Image tagging models for common/open-set categories and comprehensive captioning

Created 2 years ago

3,554 stars

Top 13.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Jesse Clark

Cofounder of Marqo

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

This project provides a suite of open-source image recognition models, including RAM++, RAM, and Tag2Text, designed for high-accuracy image tagging and comprehensive captioning. It targets researchers and developers seeking robust visual semantic analysis capabilities, offering strong zero-shot generalization and the ability to recognize both common and open-set categories.

How It Works

The models leverage a Swin Transformer backbone and are trained on large-scale datasets like COCO, VG, SBU, and CC3M/CC12M. RAM++ and RAM excel at image tagging by utilizing a data engine for annotation generation and cleaning, achieving superior accuracy and zero-shot performance compared to models like CLIP and BLIP. Tag2Text integrates tagging information into text generation, enabling controllable and comprehensive image captioning.

Quick Start & Requirements

Install: pip install git+https://github.com/xinyu1205/recognize-anything.git
Prerequisites: Python 3.8+, PyTorch. Pre-trained checkpoints are required and must be downloaded into a pretrained folder.
Setup: Requires downloading model checkpoints. Inference examples are provided.
Docs: Demo, Papers

Highlighted Details

RAM++ outperforms SOTA models on common, uncommon, and human-object interaction phrases.
RAM achieves competitive performance with Google's tagging API and surpasses fully supervised methods in zero-shot scenarios.
Tag2Text supports simultaneous tagging and captioning, with tags guiding text generation for enhanced control.
The project integrates with Grounding-DINO and SAM for a visual semantic analysis pipeline.

Maintenance & Community

The project acknowledges contributions from various individuals and mentions integration with other projects like Grounded-SAM, Ask-Anything, and Prompt-can-anything.

Licensing & Compatibility

The project is released under an unspecified license. The README does not explicitly state licensing terms, which may impact commercial use or closed-source linking.

Limitations & Caveats

Training and fine-tuning require significant computational resources (e.g., 8 A100 GPUs). The project relies on OpenAI API keys for generating custom tag descriptions, which incurs costs. The license is not explicitly stated, requiring further investigation for commercial applications.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

40 stars in the last 30 days