collaborative-experts by albanie

Video embeddings for retrieval with natural language queries

Created 6 years ago

342 stars

Top 80.8% on SourcePulse

Project Summary

This repository provides code for learning and evaluating joint video-text embeddings for video retrieval. It introduces "Collaborative Experts" (CE), a framework that leverages multiple modalities (RGB, audio, OCR, speech, etc.) and a distillation setup called "TeachText" to improve retrieval performance. The target audience includes researchers and engineers working on multimodal understanding and video retrieval tasks.

How It Works

The Collaborative Experts (CE) framework combines features from various "expert" models (e.g., RGB, audio, OCR) into a unified representation. It achieves robustness by utilizing a wide range of modalities and a module that robustly combines these into a fixed-size representation. TeachText enhances the training signal for the retrieval model by using complementary cues from multiple text encoders in a generalized distillation setup.

Quick Start & Requirements

Install: pip install -r requirements/pip-requirements.txt
Prerequisites: PyTorch 1.4, Python 3.7. Requires downloading pre-trained expert features for specific datasets (e.g., MSRVTT: 19.6 GiB).
Evaluation: Download pre-trained models and run python3 test.py --config <config.json> --resume <model.pth> --device <gpu-id>.
Training: Download pre-trained experts and run python3 train.py --config <config.json> --device <gpu-id>.
Links: Project Page, TeachText Paper, CVPR 2020 Pentathlon Challenge

Highlighted Details

Achieves state-of-the-art results on multiple video retrieval benchmarks (MSRVTT, MSVD, DiDeMo, ActivityNet, LSMDC).
Detailed ablation studies demonstrate the importance of different experts and model components.
Provides pre-trained models and expert features for several datasets.
Includes a visualization tool for retrieval rankings.

Maintenance & Community

The project is associated with the University of Oxford's Visual Geometry Group (VGG). Key contributors include Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. The work is inspired by Miech et al.'s Mixture-of-Embedding-Experts.

Licensing & Compatibility

The repository does not explicitly state a license. The provided code and pre-trained models are likely intended for research purposes. Commercial use would require careful review of any associated licenses or permissions.

collaborative-experts by albanie

Explore Similar Projects

VideoGPT-plus by mbzuai-oryx

tarsier by bytedance

VLog by showlab

UniVTG by showlab

LaViLa by facebookresearch

moment_detr by jayleicn

awesome-video-text-retrieval by danieljf24

MiniGPT4-video by Vision-CAIR

natural-language-youtube-search by haltakov

Awesome-LLMs-for-Video-Understanding by yunlong10

video-diffusion-pytorch by lucidrains

VideoRAG by HKUDS