cartridges by HazyResearch

Lightweight long context representation for LLMs

Created 1 year ago

296 stars

Top 89.3% on SourcePulse

View on GitHub

5 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Will Brown

Research Lead at Prime Intellect

Piero Molino

Cofounder of Predibase

Akshat Bubna

Cofounder of Modal

and 1 more!

Project Summary

Summary

Cartridges addresses the high cost of processing long contexts in Large Language Models (LLMs) by introducing a novel method for creating compact Key-Value (KV) caches. Targeting researchers and engineers working with LLMs, it enables significant throughput gains (up to 26x) while preserving generation quality, making long-context applications more efficient and cost-effective.

How It Works

The core innovation is "self-study," a test-time training recipe that distills a large corpus into a small, efficient KV cache, termed a "cartridge." This process involves generating synthetic conversational data about the corpus using AI agents (one asking questions, another answering) and then training the cartridge via context distillation. This approach drastically reduces KV cache size, directly translating to higher throughput during inference.

Quick Start & Requirements

Primary install: Clone the repository, install uv, then run uv pip install -e ..
Environment variables: Requires setting CARTRIDGES_DIR, CARTRIDGES_OUTPUT_DIR, CARTRIDGES_WANDB_PROJECT, and CARTRIDGES_WANDB_ENTITY.
Prerequisites: Python, uv, wandb, an inference server (Tokasaurus or SGLang), and GPU access are necessary. Modal is recommended for scalable inference workloads.
Links: Paper (arXiv:2506.06266), Synthesis example (examples/arxiv/arxiv_synthesize.py), Training example (examples/arxiv/arxiv_train.py), Tokasaurus (https://github.com/ScalingIntelligence/tokasaurus), SGLang (https://docs.sglang.ai/start/install.html).

Highlighted Details

Achieves up to 26x throughput improvement with maintained quality for long contexts.
"Self-study" trains compact KV caches via synthetic data generation and context distillation.
Supports diverse data sources including text files, JSON, LaTeX, Slack messages, and Gmail.
Recommends Tokasaurus for high-throughput inference serving.
Integrates with Modal for scalable, serverless inference workloads.
Provides loss-based (perplexity) and generation-based evaluations logged via WandB.

Maintenance & Community

Compute resources for this project were provided by Modal, Together, Prime Intellect, Voltage Park, and Azure. No explicit community channels (e.g., Discord, Slack) are listed in the README. The roadmap and known issues are detailed in the "TODOs" section.

Licensing & Compatibility

The license type is indicated by a GitHub badge but not explicitly stated in the README text. No specific compatibility notes for commercial use or closed-source linking are provided.

Limitations & Caveats

Occasional NCCL collective operation timeouts during data parallel training may require setting distributed_backend="gloo". Trained cartridges are not yet uploadable to HuggingFace. Local chat functionality currently requires downloading cartridges from WandB, not directly from local files.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days