m3-agent  by ByteDance-Seed

Multimodal agent with long-term memory

Created 1 month ago
928 stars

Top 39.4% on SourcePulse

GitHubView on GitHub
Project Summary

M3-Agent is a multimodal agent framework designed for complex, long-term reasoning tasks, targeting researchers and developers building advanced AI agents. It enables agents to process continuous visual and auditory inputs, build and update an entity-centric, multimodal long-term memory (both episodic and semantic), and use this memory for iterative, instruction-driven task completion.

How It Works

M3-Agent employs a dual-process architecture: memorization and control. The memorization process analyzes video and audio streams to construct a multimodal graph representing episodic and semantic memories. The control process iteratively reasons, retrieves information from this memory, and executes instructions. This entity-centric, multimodal memory structure allows for deeper, more consistent environmental understanding and effective memory-based reasoning.

Quick Start & Requirements

  • Installation: Clone the repository and run bash setup.sh. Install dependencies via pip install.
  • Prerequisites: Python 3.x, ffmpeg, Hugging Face Transformers, vllm. Specific models and intermediate outputs can be downloaded from Hugging Face.
  • Setup: Requires downloading datasets (M3-Bench-robot, M3-Bench-web) and potentially intermediate outputs and memory graphs. Configuration of API keys in configs/api_config.json is necessary.

Highlighted Details

  • Achieves state-of-the-art performance on the M3-Bench benchmark, outperforming strong baselines like Gemini-1.5-pro and GPT-4o by significant margins on long-video question answering.
  • Introduces M3-Bench, a novel benchmark for evaluating multimodal agents' long-term memory and reasoning capabilities, featuring real-world robot-perspective videos and diverse web-sourced videos.
  • Memory is structured as an entity-centric, multimodal graph, facilitating richer understanding and retrieval.
  • Supports prompting different large language models (LLMs) for memorization and control tasks.

Maintenance & Community

The project is associated with ByteDance. Links to training repositories for memorization and control are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README focuses on the agent's capabilities and benchmark performance, with limited details on specific limitations, unsupported platforms, or known issues. The project appears to be research-oriented, and production-readiness is not explicitly stated.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
521 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.