Long-CLIP  by beichenzbc

Research paper code for extending CLIP's text input length

Created 1 year ago
850 stars

Top 42.0% on SourcePulse

GitHubView on GitHub
Project Summary

Long-CLIP enhances CLIP's ability to process extended text inputs, addressing limitations in understanding lengthy descriptions for vision-language tasks. It targets researchers and developers working with long-form text and image data, offering improved performance in retrieval and classification.

How It Works

Long-CLIP modifies the standard CLIP architecture to accommodate longer text sequences, increasing the maximum input length from 77 to 248 tokens. This is achieved through architectural adjustments that enable efficient processing of extended textual context, leading to a reported 20% improvement in R@5 for long-caption text-image retrieval and a 6% improvement in traditional retrieval tasks.

Quick Start & Requirements

  • Install: Clone the repository and install CLIP dependencies.
  • Prerequisites: PyTorch, Pillow, and CLIP. CUDA is recommended for GPU acceleration.
  • Usage: Download checkpoints (longclip-B.pt, longclip-L.pt) and place them in ./checkpoints. Refer to the provided Python snippet for inference.
  • Evaluation: Scripts are available for zero-shot classification (ImageNet, CIFAR) and text-image retrieval (COCO2017, Flickr30k).
  • Training: Details are in train/train.md.
  • Demos: Available for Long-CLIP-SDXL integration and long-caption retrieval.

Highlighted Details

  • Increases CLIP's maximum input length from 77 to 248 tokens.
  • Achieves 20% R@5 improvement on long-caption retrieval and 6% on traditional retrieval.
  • Fine-tuning takes approximately 0.5 hours on 8 GPUs.
  • Offers plug-and-play integration with existing CLIP-based workflows, including SDXL.

Maintenance & Community

The project is associated with ECCV 2024. The README indicates active development with recent updates and bug fixes. Community channels are not explicitly mentioned.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

The README does not specify the license, which may impact commercial adoption. While presented as plug-and-play, specific integration details for various CLIP-based applications beyond SDXL might require further investigation.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0.1%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
4 more.

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
Created 2 years ago
Updated 1 year ago
Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 11 months ago
Feedback? Help us improve.