Discover and explore top open-source AI tools and projects—updated daily.
boheumdResearch paper for long-term video understanding using memory-augmented multimodal model
Top 80.9% on SourcePulse
MA-LMM addresses long-term video understanding by introducing a memory-augmented large multimodal model. It targets researchers and practitioners in video analysis, offering a plug-and-play module for enhanced zero-shot video understanding tasks like question answering and summarization.
How It Works
MA-LMM integrates a memory bank into a large multimodal model (LMM), specifically building upon InstructBLIP. This memory bank stores and retrieves relevant information from long video sequences, enabling the model to maintain context over extended durations. The core innovation lies in its memory compression algorithm, which efficiently manages the memory bank's size and content, allowing for effective long-term temporal reasoning without requiring extensive fine-tuning.
Quick Start & Requirements
pip install -e . (after git clone and cd MA-LMM)contexttimer, eva-decord (or decord for non-Apple Silicon), einops>=0.4.1, fairscale==0.4.4. Requires Vicuna-v1.1 LLM weights. Datasets (LVU, Breakfast, COIN, MSRVTT, MSVD, ActivityNet, Youcook2) need to be downloaded and frames extracted (e.g., using extract_frames.py).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive
NExT-GPT
LargeWorldModel