ml-slowfast-llava  by apple

Video understanding and reasoning with a training-free LLM

Created 1 year ago
266 stars

Top 96.2% on SourcePulse

GitHubView on GitHub
Project Summary

SlowFast-LLaVA addresses the challenge of video understanding and reasoning using a training-free approach with large language models (LLMs). It's designed for researchers and practitioners who need a strong baseline for video question-answering (VideoQA) tasks without the need for extensive fine-tuning, offering comparable or superior performance to existing state-of-the-art methods.

How It Works

SlowFast-LLaVA leverages a dual-pathway architecture inspired by the SlowFast network, processing video through both slow and fast pathways. This allows for capturing both fine-grained temporal details and broader motion patterns. Crucially, it integrates these visual features with LLMs in a training-free manner, enabling direct inference and evaluation on various VideoQA benchmarks. This approach bypasses the computationally intensive fine-tuning process, making it a highly efficient baseline.

Quick Start & Requirements

  • Installation: Requires CUDA 11.7, Python >= 3.10.12, and PyTorch >= 2.1.0. A conda environment is recommended (conda create -n sf_llava python=3.10.12). Install dependencies using bash setup_env.sh.
  • Prerequisites: OpenAI API key and organization ID are needed for GPT-3.5-turbo evaluation. Pre-trained LLaVA-NeXT weights (7B and 34B) must be downloaded from HuggingFace and placed in the ml-slowfast-llava directory.
  • Data Preparation: Specific scripts are provided to reformat various VideoQA datasets (MSVD-QA, MSRVTT-QA, TGIF-QA, Activitynet-QA, NExT-QA, EgoSchema, IntentQA, VCGBench). Raw videos need to be downloaded and organized according to the specified directory structure.
  • Inference: Run inference using python run_inference.py --exp_config $PATH_TO_CONFIG_FILE. The 34B model requires GPUs with at least 80GB of memory.
  • Demo: A script run_demo.py is available for single-video demonstrations.

Highlighted Details

  • Achieves comparable or better performance than state-of-the-art Video LLMs on various VideoQA tasks without fine-tuning.
  • Employs a training-free multimodal LLM approach for video understanding and reasoning.
  • Utilizes a dual-pathway (SlowFast) visual feature extraction.

Maintenance & Community

The project is associated with Apple and the research paper "SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models" by Xu et al. (2024). Further community or maintenance details are not specified in the README.

Licensing & Compatibility

  • License: Apple Sample Code License.
  • Compatibility: Specific compatibility notes for commercial use or closed-source linking are not detailed.

Limitations & Caveats

The README does not specify any known limitations, alpha status, or potential caveats. The 34B model has significant GPU memory requirements (80GB).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.