recognize-anything  by xinyu1205

Image tagging models for common/open-set categories and comprehensive captioning

created 2 years ago
3,361 stars

Top 14.8% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a suite of open-source image recognition models, including RAM++, RAM, and Tag2Text, designed for high-accuracy image tagging and comprehensive captioning. It targets researchers and developers seeking robust visual semantic analysis capabilities, offering strong zero-shot generalization and the ability to recognize both common and open-set categories.

How It Works

The models leverage a Swin Transformer backbone and are trained on large-scale datasets like COCO, VG, SBU, and CC3M/CC12M. RAM++ and RAM excel at image tagging by utilizing a data engine for annotation generation and cleaning, achieving superior accuracy and zero-shot performance compared to models like CLIP and BLIP. Tag2Text integrates tagging information into text generation, enabling controllable and comprehensive image captioning.

Quick Start & Requirements

  • Install: pip install git+https://github.com/xinyu1205/recognize-anything.git
  • Prerequisites: Python 3.8+, PyTorch. Pre-trained checkpoints are required and must be downloaded into a pretrained folder.
  • Setup: Requires downloading model checkpoints. Inference examples are provided.
  • Docs: Demo, Papers

Highlighted Details

  • RAM++ outperforms SOTA models on common, uncommon, and human-object interaction phrases.
  • RAM achieves competitive performance with Google's tagging API and surpasses fully supervised methods in zero-shot scenarios.
  • Tag2Text supports simultaneous tagging and captioning, with tags guiding text generation for enhanced control.
  • The project integrates with Grounding-DINO and SAM for a visual semantic analysis pipeline.

Maintenance & Community

The project acknowledges contributions from various individuals and mentions integration with other projects like Grounded-SAM, Ask-Anything, and Prompt-can-anything.

Licensing & Compatibility

The project is released under an unspecified license. The README does not explicitly state licensing terms, which may impact commercial use or closed-source linking.

Limitations & Caveats

Training and fine-tuning require significant computational resources (e.g., 8 A100 GPUs). The project relies on OpenAI API keys for generating custom tag descriptions, which incurs costs. The license is not explicitly stated, requiring further investigation for commercial applications.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
160 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
352
Vision-language research paper using LLMs
created 2 years ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.