Video embeddings for retrieval with natural language queries
Top 81.9% on sourcepulse
This repository provides code for learning and evaluating joint video-text embeddings for video retrieval. It introduces "Collaborative Experts" (CE), a framework that leverages multiple modalities (RGB, audio, OCR, speech, etc.) and a distillation setup called "TeachText" to improve retrieval performance. The target audience includes researchers and engineers working on multimodal understanding and video retrieval tasks.
How It Works
The Collaborative Experts (CE) framework combines features from various "expert" models (e.g., RGB, audio, OCR) into a unified representation. It achieves robustness by utilizing a wide range of modalities and a module that robustly combines these into a fixed-size representation. TeachText enhances the training signal for the retrieval model by using complementary cues from multiple text encoders in a generalized distillation setup.
Quick Start & Requirements
pip install -r requirements/pip-requirements.txt
python3 test.py --config <config.json> --resume <model.pth> --device <gpu-id>
.python3 train.py --config <config.json> --device <gpu-id>
.Highlighted Details
Maintenance & Community
The project is associated with the University of Oxford's Visual Geometry Group (VGG). Key contributors include Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. The work is inspired by Miech et al.'s Mixture-of-Embedding-Experts.
Licensing & Compatibility
The repository does not explicitly state a license. The provided code and pre-trained models are likely intended for research purposes. Commercial use would require careful review of any associated licenses or permissions.
Limitations & Caveats
2 years ago
1 week