FG-CLIP by 360CVGroup

Fine-grained text-image alignment model for enhanced discrimination

Created 7 months ago

481 stars

Top 63.7% on SourcePulse

Project Summary

FG-CLIP is a novel text-image cross-modal model designed for fine-grained discrimination and embedding. It targets researchers and developers seeking enhanced accuracy in tasks requiring precise visual and textual understanding, offering superior performance over standard CLIP models.

How It Works

FG-CLIP employs a two-stage training strategy. The initial stage aligns global image-caption pairs to establish a foundational understanding. The subsequent stage refines this alignment by incorporating detailed region-level captions, including specific descriptions of image areas and distinguishing between positive and negative examples. This approach enables more granular discrimination and richer feature embeddings.

Quick Start & Requirements

Installation: Requires Python 3.10. Install via Conda: conda create -n FGCLIP python=3.10 -y, conda activate FGCLIP, then cd FG-CLIP && pip install -e ..
Prerequisites: CUDA is implied for GPU acceleration (e.g., .cuda() calls).
Models: Pre-trained models are available on Hugging Face: 🤗Vit-B@224px and 🤗Vit-L@336px.
Demos & API: Two demos for fine-grained retrieval and dense feature display are available. API access for FG-CLIP v2 is provided at research.360.cn.

Highlighted Details

Achieves fine-grained visual and textual alignment, excelling in discrimination tasks.
Official implementation accepted by ICML'25.
Introduces the FineHARD dataset, curated with detailed region captions and challenging negative samples.
Models are readily available on Hugging Face for easy integration.
FG-CLIP v2 offers API access and improved performance.

Maintenance & Community

The project is actively seeking academic interns in the Multimodal field; interested candidates can contact xiechunyu@360.cn. No specific community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

The core project is licensed under the Apache License 2.0. However, users must comply with the respective original licenses for any datasets and checkpoints utilized. The Apache 2.0 license is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The training script notes that fp16 precision may lead to gradient NaNs, recommending zero2, tf32, or bf16 precision. While v1 is open-sourced, v2 is available via API, suggesting potential differences in features or performance. The project's association with an upcoming conference paper indicates a research focus, and production-readiness should be independently assessed.

FG-CLIP by 360CVGroup

Explore Similar Projects

fromage by kohjingyu

X-VLM by zengyan-97

VisualGPT by Vision-CAIR

VLP by LuoweiZhou

Show-o by showlab

Ovis by AIDC-AI

MiniGPT-4-ZH by RiseInRose

Oscar by microsoft

recognize-anything by xinyu1205

Qwen-VL by QwenLM

LAVIS by salesforce

CLIP by openai