cartridges  by HazyResearch

Lightweight long context representation for LLMs

Created 10 months ago
258 stars

Top 98.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Cartridges addresses the high cost of processing long contexts in Large Language Models (LLMs) by introducing a novel method for creating compact Key-Value (KV) caches. Targeting researchers and engineers working with LLMs, it enables significant throughput gains (up to 26x) while preserving generation quality, making long-context applications more efficient and cost-effective.

How It Works

The core innovation is "self-study," a test-time training recipe that distills a large corpus into a small, efficient KV cache, termed a "cartridge." This process involves generating synthetic conversational data about the corpus using AI agents (one asking questions, another answering) and then training the cartridge via context distillation. This approach drastically reduces KV cache size, directly translating to higher throughput during inference.

Quick Start & Requirements

  • Primary install: Clone the repository, install uv, then run uv pip install -e ..
  • Environment variables: Requires setting CARTRIDGES_DIR, CARTRIDGES_OUTPUT_DIR, CARTRIDGES_WANDB_PROJECT, and CARTRIDGES_WANDB_ENTITY.
  • Prerequisites: Python, uv, wandb, an inference server (Tokasaurus or SGLang), and GPU access are necessary. Modal is recommended for scalable inference workloads.
  • Links: Paper (arXiv:2506.06266), Synthesis example (examples/arxiv/arxiv_synthesize.py), Training example (examples/arxiv/arxiv_train.py), Tokasaurus (https://github.com/ScalingIntelligence/tokasaurus), SGLang (https://docs.sglang.ai/start/install.html).

Highlighted Details

  • Achieves up to 26x throughput improvement with maintained quality for long contexts.
  • "Self-study" trains compact KV caches via synthetic data generation and context distillation.
  • Supports diverse data sources including text files, JSON, LaTeX, Slack messages, and Gmail.
  • Recommends Tokasaurus for high-throughput inference serving.
  • Integrates with Modal for scalable, serverless inference workloads.
  • Provides loss-based (perplexity) and generation-based evaluations logged via WandB.

Maintenance & Community

Compute resources for this project were provided by Modal, Together, Prime Intellect, Voltage Park, and Azure. No explicit community channels (e.g., Discord, Slack) are listed in the README. The roadmap and known issues are detailed in the "TODOs" section.

Licensing & Compatibility

The license type is indicated by a GitHub badge but not explicitly stated in the README text. No specific compatibility notes for commercial use or closed-source linking are provided.

Limitations & Caveats

Occasional NCCL collective operation timeouts during data parallel training may require setting distributed_backend="gloo". Trained cartridges are not yet uploadable to HuggingFace. Local chat functionality currently requires downloading cartridges from WandB, not directly from local files.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
4 more.

LongLoRA by JIA-Lab-research

0.0%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.