ml-slowfast-llava by apple

Video understanding and reasoning with a training-free LLM

Created 1 year ago

287 stars

Top 91.5% on SourcePulse

Project Summary

SlowFast-LLaVA addresses the challenge of video understanding and reasoning using a training-free approach with large language models (LLMs). It's designed for researchers and practitioners who need a strong baseline for video question-answering (VideoQA) tasks without the need for extensive fine-tuning, offering comparable or superior performance to existing state-of-the-art methods.

How It Works

SlowFast-LLaVA leverages a dual-pathway architecture inspired by the SlowFast network, processing video through both slow and fast pathways. This allows for capturing both fine-grained temporal details and broader motion patterns. Crucially, it integrates these visual features with LLMs in a training-free manner, enabling direct inference and evaluation on various VideoQA benchmarks. This approach bypasses the computationally intensive fine-tuning process, making it a highly efficient baseline.

Quick Start & Requirements

Installation: Requires CUDA 11.7, Python >= 3.10.12, and PyTorch >= 2.1.0. A conda environment is recommended (conda create -n sf_llava python=3.10.12). Install dependencies using bash setup_env.sh.
Prerequisites: OpenAI API key and organization ID are needed for GPT-3.5-turbo evaluation. Pre-trained LLaVA-NeXT weights (7B and 34B) must be downloaded from HuggingFace and placed in the ml-slowfast-llava directory.
Data Preparation: Specific scripts are provided to reformat various VideoQA datasets (MSVD-QA, MSRVTT-QA, TGIF-QA, Activitynet-QA, NExT-QA, EgoSchema, IntentQA, VCGBench). Raw videos need to be downloaded and organized according to the specified directory structure.
Inference: Run inference using python run_inference.py --exp_config $PATH_TO_CONFIG_FILE. The 34B model requires GPUs with at least 80GB of memory.
Demo: A script run_demo.py is available for single-video demonstrations.

Highlighted Details

Achieves comparable or better performance than state-of-the-art Video LLMs on various VideoQA tasks without fine-tuning.
Employs a training-free multimodal LLM approach for video understanding and reasoning.
Utilizes a dual-pathway (SlowFast) visual feature extraction.

Maintenance & Community

The project is associated with Apple and the research paper "SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models" by Xu et al. (2024). Further community or maintenance details are not specified in the README.

ml-slowfast-llava by apple

Explore Similar Projects

DeepVideoDiscovery by microsoft

Youku-mPLUG by X-PLUG

Flash-VStream by IVGSZ

VideoGPT-plus by mbzuai-oryx

tarsier by bytedance

VideoMind by yeliudev

MovieChat by rese1f

PLLaVA by magic-research

MiniGPT4-video by Vision-CAIR

VideoLLaMA2 by DAMO-NLP-SG

Video-ChatGPT by mbzuai-oryx

Awesome-LLMs-for-Video-Understanding by yunlong10