m3-agent by ByteDance-Seed

Multimodal agent with long-term memory

Created 5 months ago

1,195 stars

Top 32.6% on SourcePulse

Project Summary

M3-Agent is a multimodal agent framework designed for complex, long-term reasoning tasks, targeting researchers and developers building advanced AI agents. It enables agents to process continuous visual and auditory inputs, build and update an entity-centric, multimodal long-term memory (both episodic and semantic), and use this memory for iterative, instruction-driven task completion.

How It Works

M3-Agent employs a dual-process architecture: memorization and control. The memorization process analyzes video and audio streams to construct a multimodal graph representing episodic and semantic memories. The control process iteratively reasons, retrieves information from this memory, and executes instructions. This entity-centric, multimodal memory structure allows for deeper, more consistent environmental understanding and effective memory-based reasoning.

Quick Start & Requirements

Installation: Clone the repository and run bash setup.sh. Install dependencies via pip install.
Prerequisites: Python 3.x, ffmpeg, Hugging Face Transformers, vllm. Specific models and intermediate outputs can be downloaded from Hugging Face.
Setup: Requires downloading datasets (M3-Bench-robot, M3-Bench-web) and potentially intermediate outputs and memory graphs. Configuration of API keys in configs/api_config.json is necessary.

Highlighted Details

Achieves state-of-the-art performance on the M3-Bench benchmark, outperforming strong baselines like Gemini-1.5-pro and GPT-4o by significant margins on long-video question answering.
Introduces M3-Bench, a novel benchmark for evaluating multimodal agents' long-term memory and reasoning capabilities, featuring real-world robot-perspective videos and diverse web-sourced videos.
Memory is structured as an entity-centric, multimodal graph, facilitating richer understanding and retrieval.
Supports prompting different large language models (LLMs) for memorization and control tasks.

Maintenance & Community

The project is associated with ByteDance. Links to training repositories for memorization and control are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README focuses on the agent's capabilities and benchmark performance, with limited details on specific limitations, unsupported platforms, or known issues. The project appears to be research-oriented, and production-readiness is not explicitly stated.

m3-agent by ByteDance-Seed

Explore Similar Projects

Awesome_Efficient_LRM_Reasoning by XiaoYee

Awesome-Large-Multimodal-Reasoning-Models by HITsz-TMG

ITFormer-ICML25 by Pandalin98

Survey_Memory_in_AI by Elvin-Yiming-Du

telemem by TeleAI-UAGI

OneThinker by tulerfeng

Aetherius_AI_Assistant by libraryofcelsus

LightMem by zjunlp

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Teenage-AGI by seanpixel

EverMemOS by EverMind-AI

together-cookbook by togethercomputer