Discover and explore top open-source AI tools and projects—updated daily.
Fine-grained text-image alignment model for enhanced discrimination
Top 85.1% on SourcePulse
FG-CLIP is a novel text-image cross-modal model designed for fine-grained discrimination and embedding. It targets researchers and developers seeking enhanced accuracy in tasks requiring precise visual and textual understanding, offering superior performance over standard CLIP models.
How It Works
FG-CLIP employs a two-stage training strategy. The initial stage aligns global image-caption pairs to establish a foundational understanding. The subsequent stage refines this alignment by incorporating detailed region-level captions, including specific descriptions of image areas and distinguishing between positive and negative examples. This approach enables more granular discrimination and richer feature embeddings.
Quick Start & Requirements
conda create -n FGCLIP python=3.10 -y
, conda activate FGCLIP
, then cd FG-CLIP && pip install -e .
..cuda()
calls).research.360.cn
.Highlighted Details
Maintenance & Community
The project is actively seeking academic interns in the Multimodal field; interested candidates can contact xiechunyu@360.cn. No specific community channels (e.g., Discord, Slack) are listed.
Licensing & Compatibility
The core project is licensed under the Apache License 2.0. However, users must comply with the respective original licenses for any datasets and checkpoints utilized. The Apache 2.0 license is generally permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
The training script notes that fp16 precision may lead to gradient NaNs, recommending zero2, tf32, or bf16 precision. While v1 is open-sourced, v2 is available via API, suggesting potential differences in features or performance. The project's association with an upcoming conference paper indicates a research focus, and production-readiness should be independently assessed.
12 hours ago
Inactive