Cap4Video  by whwu95

Text-video retrieval enhanced by LLM-generated captions

Created 3 years ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

Cap4Video presents an innovative framework designed to significantly improve text-video retrieval performance by effectively utilizing auxiliary captions generated by powerful Large Language Models (LLMs). Targeting researchers and engineers in the field of computer vision and cross-modal understanding, the project offers a method to enhance video-text matching accuracy, building upon existing vision-language models.

How It Works

Cap4Video enhances text-video retrieval by integrating auxiliary captions generated by large language models. Its core approach involves three key stages: augmenting training data with these synthetic captions, enabling intermediate feature interaction between video and caption representations for more compact video embeddings, and fusing output scores to boost overall text-video matching accuracy. This multi-pronged strategy aims to maximize the semantic richness derived from auxiliary captions, improving both global and fine-grained matching tasks.

Quick Start & Requirements

  • Installation: Requires PyTorch 1.8.1 with CUDA 11.1, along with ftfy, regex, tqdm, opencv-python, boto3, and pandas.
  • Prerequisites: Pre-trained CLIP B/32 and CLIP B/16 models, video datasets preprocessed into frames, and provided caption files.
  • Resources: Official links to CVPR'2023 paper and arXiv preprint are available.

Highlighted Details

  • Recognized as a CVPR'2023 Highlight paper, indicating significant impact and novelty.
  • Extended version accepted for publication in TPAMI (IEEE Transactions on Pattern Analysis and Machine Intelligence), suggesting robust research and validation.
  • Demonstrates improved performance in text-video retrieval tasks, compatible with both global and fine-grained matching scenarios.
  • Leverages pre-trained vision-language models (like CLIP) and LLM-generated captions for enhanced cross-modal understanding.

Maintenance & Community

The project is actively maintained, with its extension accepted by TPAMI in May 2024. The code was released in April 2023. For questions or issues, users are directed to file an issue on the repository. No specific community channels like Discord or Slack are listed.

Licensing & Compatibility

The repository's README does not explicitly state a software license. Users should exercise caution regarding usage, particularly for commercial applications, until a license is clarified.

Limitations & Caveats

No specific limitations or known bugs are detailed in the provided README. The installation requirements specify older versions of PyTorch and CUDA, which may pose challenges for integration into modern development environments.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
356
Vision-language research paper using LLMs
Created 2 years ago
Updated 5 months ago
Feedback? Help us improve.