memento by microsoft

LLM reasoning extension framework

Created 2 months ago

488 stars

Top 62.6% on SourcePulse

Project Summary

Extends the effective output length of large language models by segmenting chain-of-thought reasoning into manageable blocks. This approach allows models to perform more complex reasoning tasks within fixed context window constraints, benefiting researchers and developers working with LLMs on extended generation or analysis.

How It Works

Memento implements a block-based reasoning strategy where chain-of-thought (CoT) is divided into discrete segments. After each reasoning block, a concise summary is generated, and the detailed block content is then evicted from the KV cache. The model continues its reasoning process from this summary, effectively reducing the context size and enabling deeper, multi-step computations within the original, fixed context window. This is facilitated by specialized tokens for block and summary boundaries and a modified inference engine.

Quick Start & Requirements

Data Pipeline: Install dependencies with pip install -r data/requirements.txt. Requires an OPENAI_API_KEY or compatible provider. Run with python run_full_pipeline.py --input ../examples/example_trace.jsonl --output-dir output/ --model gpt-4o --limit 1. See data/README.md for full documentation.
vLLM Inference: Install vllm==0.13.0 and apply the overlay: pip install vllm==0.13.0, cd vllm, bash install_overlay.sh. Serve a Memento model using python -m vllm.entrypoints.openai.api_server --model /path/to/memento-checkpoint ... --block-masking-config '{...}'. Requires a Memento model checkpoint and GPU resources. See vllm/README.md for full documentation.

Highlighted Details

Novel KV cache eviction strategy using block summaries to extend effective context length.
Specialized tokens (<|block_start|>, <|summary_start|>, etc.) for structured reasoning and summarization.
Optimized inference via a custom vLLM overlay with block masking and KV cache compaction.
Data pipeline for preparing SFT training data from CoT traces.

Maintenance & Community

No specific details regarding contributors, sponsorships, community channels (e.g., Discord/Slack), or roadmaps are provided in the README.

Licensing & Compatibility

This project is licensed under the MIT License, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The inference setup requires building a custom vLLM version using an overlay script and specifically depends on vllm==0.13.0. The README does not detail performance benchmarks or specific model compatibility beyond the general approach.

memento by microsoft

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

lambda-RLM by lambda-calculus-LLM

flex-nano-vllm by changjonathanc

xinfer by guoqingbao

simple-llm by naklecha

Extra-CoT by Mwie1024

Seed-Coder by ByteDance-Seed

ai-infra-learning by cr7258

triattention by WeianMao

unified-cache-management by ModelEngine-Group

xgrammar by mlc-ai

optillm by algorithmicsuperintelligence