MSA  by EverMind-AI

LLM memory framework scales to 100M+ tokens

Created 5 months ago
2,906 stars

Top 16.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary MSA (Memory Sparse Attention) tackles the LLM context window limitation, which restricts long-term memory and reasoning. Unlike existing solutions plagued by precision decay or complex pipelines, MSA offers an end-to-end trainable, sparse latent-state memory framework. It enables processing up to 100 million tokens with minimal degradation, significantly enhancing LLM memory capacity and reasoning for researchers and engineers.

How It Works MSA achieves near-linear complexity via scalable sparse attention and document-wise RoPE. Key components include the Memory Sparse Attention layer, integrating top-k document selection with sparse attention for differentiability and inference decoupling. A Memory Parallel inference engine uses tiered KV cache compression (GPU routing keys, CPU content K/V) for efficient 100M token throughput on specialized hardware. The Memory Interleave mechanism facilitates adaptive multi-round, multi-hop reasoning through a retrieval-expansion-generation loop, boosting performance on complex, long-context tasks.

Quick Start & Requirements Code and pre-trained models are available. Achieving 100M token inference requires substantial hardware, specifically "2×A800 GPUs." Training involves extensive continuous pretraining on a 158.95 billion token dataset. Further details and project updates are available on the official homepage: https://evermind.ai/.

Highlighted Details

  • Extreme Scalability: Demonstrates <9% degradation across 16K to 100M tokens.
  • High-Throughput Inference: Enables 100M token inference on 2×A800 GPUs via KV cache compression and Memory Parallel engine.
  • State-of-the-Art Performance: Outperforms RAG variants and leading long-context models on QA and NIAH benchmarks.
  • Enhanced Reasoning: Memory Interleave improves multi-hop reasoning across disparate memory segments.

Maintenance & Community Maintained by the authors, with updates via the official homepage: https://evermind.ai/. No specific community channels (e.g., Discord, Slack) are mentioned.

Licensing & Compatibility The provided README does not specify a software license, creating uncertainty for commercial use or integration into closed-source projects.

Limitations & Caveats The primary adoption blocker is the lack of a clear license. Maximum context lengths necessitate high-end GPU hardware (e.g., A800s). Performance may still be constrained by the underlying backbone LLM's intrinsic reasoning capacity and parameter count.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
6
Star History
2,951 stars in the last 30 days

Explore Similar Projects

Starred by Mehdi Amini Mehdi Amini(Author of MLIR; Distinguished Engineer at NVIDIA), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

flashinfer by flashinfer-ai

1.6%
5k
Kernel library for LLM serving
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.