FG-CLIP  by 360CVGroup

Fine-grained text-image alignment model for enhanced discrimination

Created 5 months ago
317 stars

Top 85.1% on SourcePulse

GitHubView on GitHub
Project Summary

FG-CLIP is a novel text-image cross-modal model designed for fine-grained discrimination and embedding. It targets researchers and developers seeking enhanced accuracy in tasks requiring precise visual and textual understanding, offering superior performance over standard CLIP models.

How It Works

FG-CLIP employs a two-stage training strategy. The initial stage aligns global image-caption pairs to establish a foundational understanding. The subsequent stage refines this alignment by incorporating detailed region-level captions, including specific descriptions of image areas and distinguishing between positive and negative examples. This approach enables more granular discrimination and richer feature embeddings.

Quick Start & Requirements

  • Installation: Requires Python 3.10. Install via Conda: conda create -n FGCLIP python=3.10 -y, conda activate FGCLIP, then cd FG-CLIP && pip install -e ..
  • Prerequisites: CUDA is implied for GPU acceleration (e.g., .cuda() calls).
  • Models: Pre-trained models are available on Hugging Face: 🤗Vit-B@224px and 🤗Vit-L@336px.
  • Demos & API: Two demos for fine-grained retrieval and dense feature display are available. API access for FG-CLIP v2 is provided at research.360.cn.

Highlighted Details

  • Achieves fine-grained visual and textual alignment, excelling in discrimination tasks.
  • Official implementation accepted by ICML'25.
  • Introduces the FineHARD dataset, curated with detailed region captions and challenging negative samples.
  • Models are readily available on Hugging Face for easy integration.
  • FG-CLIP v2 offers API access and improved performance.

Maintenance & Community

The project is actively seeking academic interns in the Multimodal field; interested candidates can contact xiechunyu@360.cn. No specific community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

The core project is licensed under the Apache License 2.0. However, users must comply with the respective original licenses for any datasets and checkpoints utilized. The Apache 2.0 license is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The training script notes that fp16 precision may lead to gradient NaNs, recommending zero2, tf32, or bf16 precision. While v1 is open-sourced, v2 is available via API, suggesting potential differences in features or performance. The project's association with an upcoming conference paper indicates a research focus, and production-readiness should be independently assessed.

Health Check
Last Commit

12 hours ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.1%
11k
Library for language-vision AI research
Created 3 years ago
Updated 11 months ago
Feedback? Help us improve.