collaborative-experts  by albanie

Video embeddings for retrieval with natural language queries

created 6 years ago
342 stars

Top 81.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code for learning and evaluating joint video-text embeddings for video retrieval. It introduces "Collaborative Experts" (CE), a framework that leverages multiple modalities (RGB, audio, OCR, speech, etc.) and a distillation setup called "TeachText" to improve retrieval performance. The target audience includes researchers and engineers working on multimodal understanding and video retrieval tasks.

How It Works

The Collaborative Experts (CE) framework combines features from various "expert" models (e.g., RGB, audio, OCR) into a unified representation. It achieves robustness by utilizing a wide range of modalities and a module that robustly combines these into a fixed-size representation. TeachText enhances the training signal for the retrieval model by using complementary cues from multiple text encoders in a generalized distillation setup.

Quick Start & Requirements

  • Install: pip install -r requirements/pip-requirements.txt
  • Prerequisites: PyTorch 1.4, Python 3.7. Requires downloading pre-trained expert features for specific datasets (e.g., MSRVTT: 19.6 GiB).
  • Evaluation: Download pre-trained models and run python3 test.py --config <config.json> --resume <model.pth> --device <gpu-id>.
  • Training: Download pre-trained experts and run python3 train.py --config <config.json> --device <gpu-id>.
  • Links: Project Page, TeachText Paper, CVPR 2020 Pentathlon Challenge

Highlighted Details

  • Achieves state-of-the-art results on multiple video retrieval benchmarks (MSRVTT, MSVD, DiDeMo, ActivityNet, LSMDC).
  • Detailed ablation studies demonstrate the importance of different experts and model components.
  • Provides pre-trained models and expert features for several datasets.
  • Includes a visualization tool for retrieval rankings.

Maintenance & Community

The project is associated with the University of Oxford's Visual Geometry Group (VGG). Key contributors include Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. The work is inspired by Miech et al.'s Mixture-of-Embedding-Experts.

Licensing & Compatibility

The repository does not explicitly state a license. The provided code and pre-trained models are likely intended for research purposes. Commercial use would require careful review of any associated licenses or permissions.

Limitations & Caveats

  • The code is tested with PyTorch 1.4 and Python 3.7; compatibility with newer versions is not guaranteed.
  • Requires significant disk space for pre-trained expert features (up to 19.6 GiB per dataset).
  • A previous version of the codebase contained a bug that overestimated performance; the current version has been corrected.
  • Access to LSMDC dataset features requires explicit permission from MPII and confirmation from the LSMDC team.
Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.