Discover and explore top open-source AI tools and projects—updated daily.
Multimodal agent with long-term memory
Top 39.4% on SourcePulse
M3-Agent is a multimodal agent framework designed for complex, long-term reasoning tasks, targeting researchers and developers building advanced AI agents. It enables agents to process continuous visual and auditory inputs, build and update an entity-centric, multimodal long-term memory (both episodic and semantic), and use this memory for iterative, instruction-driven task completion.
How It Works
M3-Agent employs a dual-process architecture: memorization and control. The memorization process analyzes video and audio streams to construct a multimodal graph representing episodic and semantic memories. The control process iteratively reasons, retrieves information from this memory, and executes instructions. This entity-centric, multimodal memory structure allows for deeper, more consistent environmental understanding and effective memory-based reasoning.
Quick Start & Requirements
bash setup.sh
. Install dependencies via pip install
.ffmpeg
, Hugging Face Transformers, vllm
. Specific models and intermediate outputs can be downloaded from Hugging Face.configs/api_config.json
is necessary.Highlighted Details
Maintenance & Community
The project is associated with ByteDance. Links to training repositories for memorization and control are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The README focuses on the agent's capabilities and benchmark performance, with limited details on specific limitations, unsupported platforms, or known issues. The project appears to be research-oriented, and production-readiness is not explicitly stated.
2 weeks ago
Inactive