Long-CLIP  by beichenzbc

Research paper code for extending CLIP's text input length

created 1 year ago
839 stars

Top 43.3% on sourcepulse

GitHubView on GitHub
Project Summary

Long-CLIP enhances CLIP's ability to process extended text inputs, addressing limitations in understanding lengthy descriptions for vision-language tasks. It targets researchers and developers working with long-form text and image data, offering improved performance in retrieval and classification.

How It Works

Long-CLIP modifies the standard CLIP architecture to accommodate longer text sequences, increasing the maximum input length from 77 to 248 tokens. This is achieved through architectural adjustments that enable efficient processing of extended textual context, leading to a reported 20% improvement in R@5 for long-caption text-image retrieval and a 6% improvement in traditional retrieval tasks.

Quick Start & Requirements

  • Install: Clone the repository and install CLIP dependencies.
  • Prerequisites: PyTorch, Pillow, and CLIP. CUDA is recommended for GPU acceleration.
  • Usage: Download checkpoints (longclip-B.pt, longclip-L.pt) and place them in ./checkpoints. Refer to the provided Python snippet for inference.
  • Evaluation: Scripts are available for zero-shot classification (ImageNet, CIFAR) and text-image retrieval (COCO2017, Flickr30k).
  • Training: Details are in train/train.md.
  • Demos: Available for Long-CLIP-SDXL integration and long-caption retrieval.

Highlighted Details

  • Increases CLIP's maximum input length from 77 to 248 tokens.
  • Achieves 20% R@5 improvement on long-caption retrieval and 6% on traditional retrieval.
  • Fine-tuning takes approximately 0.5 hours on 8 GPUs.
  • Offers plug-and-play integration with existing CLIP-based workflows, including SDXL.

Maintenance & Community

The project is associated with ECCV 2024. The README indicates active development with recent updates and bug fixes. Community channels are not explicitly mentioned.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

The README does not specify the license, which may impact commercial adoption. While presented as plug-and-play, specific integration details for various CLIP-based applications beyond SDXL might require further investigation.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
40 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.