Research paper code for extending CLIP's text input length
Top 43.3% on sourcepulse
Long-CLIP enhances CLIP's ability to process extended text inputs, addressing limitations in understanding lengthy descriptions for vision-language tasks. It targets researchers and developers working with long-form text and image data, offering improved performance in retrieval and classification.
How It Works
Long-CLIP modifies the standard CLIP architecture to accommodate longer text sequences, increasing the maximum input length from 77 to 248 tokens. This is achieved through architectural adjustments that enable efficient processing of extended textual context, leading to a reported 20% improvement in R@5 for long-caption text-image retrieval and a 6% improvement in traditional retrieval tasks.
Quick Start & Requirements
longclip-B.pt
, longclip-L.pt
) and place them in ./checkpoints
. Refer to the provided Python snippet for inference.train/train.md
.Highlighted Details
Maintenance & Community
The project is associated with ECCV 2024. The README indicates active development with recent updates and bug fixes. Community channels are not explicitly mentioned.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.
Limitations & Caveats
The README does not specify the license, which may impact commercial adoption. While presented as plug-and-play, specific integration details for various CLIP-based applications beyond SDXL might require further investigation.
11 months ago
1 day