MA-LMM  by boheumd

Research paper for long-term video understanding using memory-augmented multimodal model

created 1 year ago
318 stars

Top 86.3% on sourcepulse

GitHubView on GitHub
Project Summary

MA-LMM addresses long-term video understanding by introducing a memory-augmented large multimodal model. It targets researchers and practitioners in video analysis, offering a plug-and-play module for enhanced zero-shot video understanding tasks like question answering and summarization.

How It Works

MA-LMM integrates a memory bank into a large multimodal model (LMM), specifically building upon InstructBLIP. This memory bank stores and retrieves relevant information from long video sequences, enabling the model to maintain context over extended durations. The core innovation lies in its memory compression algorithm, which efficiently manages the memory bank's size and content, allowing for effective long-term temporal reasoning without requiring extensive fine-tuning.

Quick Start & Requirements

  • Install: pip install -e . (after git clone and cd MA-LMM)
  • Prerequisites: Python, contexttimer, eva-decord (or decord for non-Apple Silicon), einops>=0.4.1, fairscale==0.4.4. Requires Vicuna-v1.1 LLM weights. Datasets (LVU, Breakfast, COIN, MSRVTT, MSVD, ActivityNet, Youcook2) need to be downloaded and frames extracted (e.g., using extract_frames.py).
  • Resources: Fine-tuning requires four A100 GPUs.
  • Links: Project Page, Demo

Highlighted Details

  • Plug-and-play module for InstructBLIP, enabling zero-shot evaluation without fine-tuning.
  • Supports long-term video understanding tasks including video question answering and summarization.
  • Memory bank compression algorithm for efficient long-term context management.
  • Experiments conducted on LVU, Breakfast, COIN, MSRVTT, MSVD, ActivityNet, and Youcook2 datasets.

Maintenance & Community

  • Code references LAVIS.
  • No explicit community links (Discord/Slack) or roadmap provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The code is provided for research purposes.

Limitations & Caveats

  • The project is associated with a 2024 CVPR paper, suggesting it is research-oriented and may not be production-ready. Frame extraction may require adjustments due to FFmpeg version inconsistencies.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

LWM by LargeWorldModel

0.0%
7k
Multimodal autoregressive model for long-context video/text
created 1 year ago
updated 9 months ago
Feedback? Help us improve.