MSA by EverMind-AI

LLM memory framework scales to 100M+ tokens

Created 8 months ago

3,492 stars

Top 13.4% on SourcePulse

Project Summary

Summary MSA (Memory Sparse Attention) tackles the LLM context window limitation, which restricts long-term memory and reasoning. Unlike existing solutions plagued by precision decay or complex pipelines, MSA offers an end-to-end trainable, sparse latent-state memory framework. It enables processing up to 100 million tokens with minimal degradation, significantly enhancing LLM memory capacity and reasoning for researchers and engineers.

How It Works MSA achieves near-linear complexity via scalable sparse attention and document-wise RoPE. Key components include the Memory Sparse Attention layer, integrating top-k document selection with sparse attention for differentiability and inference decoupling. A Memory Parallel inference engine uses tiered KV cache compression (GPU routing keys, CPU content K/V) for efficient 100M token throughput on specialized hardware. The Memory Interleave mechanism facilitates adaptive multi-round, multi-hop reasoning through a retrieval-expansion-generation loop, boosting performance on complex, long-context tasks.

Quick Start & Requirements Code and pre-trained models are available. Achieving 100M token inference requires substantial hardware, specifically "2×A800 GPUs." Training involves extensive continuous pretraining on a 158.95 billion token dataset. Further details and project updates are available on the official homepage: https://evermind.ai/.

Highlighted Details

Extreme Scalability: Demonstrates <9% degradation across 16K to 100M tokens.
High-Throughput Inference: Enables 100M token inference on 2×A800 GPUs via KV cache compression and Memory Parallel engine.
State-of-the-Art Performance: Outperforms RAG variants and leading long-context models on QA and NIAH benchmarks.
Enhanced Reasoning: Memory Interleave improves multi-hop reasoning across disparate memory segments.

Maintenance & Community Maintained by the authors, with updates via the official homepage: https://evermind.ai/. No specific community channels (e.g., Discord, Slack) are mentioned.

Licensing & Compatibility The provided README does not specify a software license, creating uncertainty for commercial use or integration into closed-source projects.

Limitations & Caveats The primary adoption blocker is the lack of a clear license. Maximum context lengths necessitate high-end GPU hardware (e.g., A800s). Performance may still be constrained by the underlying backbone LLM's intrinsic reasoning capacity and parameter count.

MSA by EverMind-AI

Explore Similar Projects

FILM by microsoft

MagicPIG by Infini-AI-Lab

duo-attention by mit-han-lab

cartridges by HazyResearch

Quest by mit-han-lab

gemma-2B-10M by mustafaaljadery

omniserve by mit-han-lab

Kimi-Linear by MoonshotAI

triattention by WeianMao

MInference by microsoft

unified-cache-management by ModelEngine-Group

flashinfer by flashinfer-ai