Research paper for long-term video understanding using memory-augmented multimodal model
Top 86.3% on sourcepulse
MA-LMM addresses long-term video understanding by introducing a memory-augmented large multimodal model. It targets researchers and practitioners in video analysis, offering a plug-and-play module for enhanced zero-shot video understanding tasks like question answering and summarization.
How It Works
MA-LMM integrates a memory bank into a large multimodal model (LMM), specifically building upon InstructBLIP. This memory bank stores and retrieves relevant information from long video sequences, enabling the model to maintain context over extended durations. The core innovation lies in its memory compression algorithm, which efficiently manages the memory bank's size and content, allowing for effective long-term temporal reasoning without requiring extensive fine-tuning.
Quick Start & Requirements
pip install -e .
(after git clone
and cd MA-LMM
)contexttimer
, eva-decord
(or decord
for non-Apple Silicon), einops>=0.4.1
, fairscale==0.4.4
. Requires Vicuna-v1.1 LLM weights. Datasets (LVU, Breakfast, COIN, MSRVTT, MSVD, ActivityNet, Youcook2) need to be downloaded and frames extracted (e.g., using extract_frames.py
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive