Discover and explore top open-source AI tools and projects—updated daily.
Video understanding and reasoning with a training-free LLM
Top 96.2% on SourcePulse
SlowFast-LLaVA addresses the challenge of video understanding and reasoning using a training-free approach with large language models (LLMs). It's designed for researchers and practitioners who need a strong baseline for video question-answering (VideoQA) tasks without the need for extensive fine-tuning, offering comparable or superior performance to existing state-of-the-art methods.
How It Works
SlowFast-LLaVA leverages a dual-pathway architecture inspired by the SlowFast network, processing video through both slow and fast pathways. This allows for capturing both fine-grained temporal details and broader motion patterns. Crucially, it integrates these visual features with LLMs in a training-free manner, enabling direct inference and evaluation on various VideoQA benchmarks. This approach bypasses the computationally intensive fine-tuning process, making it a highly efficient baseline.
Quick Start & Requirements
conda create -n sf_llava python=3.10.12
). Install dependencies using bash setup_env.sh
.ml-slowfast-llava
directory.python run_inference.py --exp_config $PATH_TO_CONFIG_FILE
. The 34B model requires GPUs with at least 80GB of memory.run_demo.py
is available for single-video demonstrations.Highlighted Details
Maintenance & Community
The project is associated with Apple and the research paper "SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models" by Xu et al. (2024). Further community or maintenance details are not specified in the README.
Licensing & Compatibility
Limitations & Caveats
The README does not specify any known limitations, alpha status, or potential caveats. The 34B model has significant GPU memory requirements (80GB).
1 year ago
Inactive