memento  by microsoft

LLM reasoning extension framework

Created 1 month ago
413 stars

Top 70.7% on SourcePulse

GitHubView on GitHub
Project Summary

Extends the effective output length of large language models by segmenting chain-of-thought reasoning into manageable blocks. This approach allows models to perform more complex reasoning tasks within fixed context window constraints, benefiting researchers and developers working with LLMs on extended generation or analysis.

How It Works

Memento implements a block-based reasoning strategy where chain-of-thought (CoT) is divided into discrete segments. After each reasoning block, a concise summary is generated, and the detailed block content is then evicted from the KV cache. The model continues its reasoning process from this summary, effectively reducing the context size and enabling deeper, multi-step computations within the original, fixed context window. This is facilitated by specialized tokens for block and summary boundaries and a modified inference engine.

Quick Start & Requirements

  • Data Pipeline: Install dependencies with pip install -r data/requirements.txt. Requires an OPENAI_API_KEY or compatible provider. Run with python run_full_pipeline.py --input ../examples/example_trace.jsonl --output-dir output/ --model gpt-4o --limit 1. See data/README.md for full documentation.
  • vLLM Inference: Install vllm==0.13.0 and apply the overlay: pip install vllm==0.13.0, cd vllm, bash install_overlay.sh. Serve a Memento model using python -m vllm.entrypoints.openai.api_server --model /path/to/memento-checkpoint ... --block-masking-config '{...}'. Requires a Memento model checkpoint and GPU resources. See vllm/README.md for full documentation.

Highlighted Details

  • Novel KV cache eviction strategy using block summaries to extend effective context length.
  • Specialized tokens (<|block_start|>, <|summary_start|>, etc.) for structured reasoning and summarization.
  • Optimized inference via a custom vLLM overlay with block masking and KV cache compaction.
  • Data pipeline for preparing SFT training data from CoT traces.

Maintenance & Community

No specific details regarding contributors, sponsorships, community channels (e.g., Discord/Slack), or roadmaps are provided in the README.

Licensing & Compatibility

This project is licensed under the MIT License, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The inference setup requires building a custom vLLM version using an overlay script and specifically depends on vllm==0.13.0. The README does not detail performance benchmarks or specific model compatibility beyond the general approach.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
417 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
11 more.

optillm by algorithmicsuperintelligence

0.2%
3k
Optimizing inference proxy for LLMs
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.