tokenize-anything  by baaivision

Vision model for segmenting, recognizing, and captioning arbitrary regions

Created 1 year ago
594 stars

Top 54.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project introduces Tokenize Anything via Prompting (TAP), a unified model for segmenting, recognizing, and captioning arbitrary image regions using flexible visual prompts. It targets researchers and developers working on multimodal vision-language tasks, offering a versatile tool for detailed image analysis.

How It Works

TAP employs a modular design, decoupling components for flexibility. It leverages a large-scale pre-trained EVA-CLIP model (5 billion parameters) and is trained on exhaustive segmentation masks from SA-1B. This approach allows for simultaneous segmentation, recognition, and captioning of user-specified regions via points, boxes, or sketches.

Quick Start & Requirements

  • Installation: pip install git+ssh://git@github.com/baaivision/tokenize-anything.git or clone and pip install .
  • Prerequisites: torch >= 2.1, flash-attn >= 2.3.3 (for TextGeneration). gradio-image-prompter is required for the Gradio app.
  • Resources: Three model versions are available (ViT-H, ViT-L, ViT-B) with varying performance and resource needs. Weights are available via Hugging Face.
  • Links: Paper, Demo, Inference Guide, Concept Guide

Highlighted Details

  • Unified model for segmentation, recognition, and captioning.
  • Supports flexible visual prompts: points, boxes, and sketches.
  • Trained on SA-1B dataset and uses a 5B parameter EVA-CLIP model.
  • Offers multiple model sizes (ViT-H, ViT-L, ViT-B) for different use cases.

Maintenance & Community

The project is associated with BAAI and ICT-CAS. Acknowledgements mention contributions from SAM, EVA, LLaMA, FlashAttention, Gradio, and Detectron2.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions different training schedules and dataset usage between V1.0 and V1.1 model releases, suggesting potential performance differences or breaking changes between versions.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Feedback? Help us improve.