OSCAR by FutureMLS-Lab

LLM KV cache optimization via 2-bit quantization

Created 2 months ago

550 stars

Top 57.3% on SourcePulse

Project Summary

Summary

OSCAR addresses the significant memory overhead of KV caches in Large Language Models (LLMs) by introducing an offline, attention-aware 2-bit quantization technique. It targets engineers and researchers seeking to enable longer context windows and more efficient LLM deployment, offering substantial memory compression with minimal accuracy degradation.

How It Works

The core innovation lies in estimating attention-aware K/V covariance structures offline using a calibration set. OSCAR derives per-layer rotations and clipping thresholds that align KV quantization with the actual directions attention consumes. This approach utilizes INT2 storage for the bulk of the KV cache, augmented by a small BF16 sink and recent window, achieving approximately 7x memory reduction versus standard BF16.

Quick Start & Requirements

Primary install/run: Clone repo (git clone --recursive), create Conda environment (conda create -n oscar python=3.12 -y), activate (conda activate oscar), install (pip install -e sglang-research/python).
Prerequisites: CUDA 12.8/12.9 (nvcc on $PATH), Python 3.12, Conda. Hardware: H100 80 GB (4B/8B models), 4x H100 (32B / MiniMax-M2.7), 8x H100 (GLM-4.7-FP8). HuggingFace access required.
Resource footprint: ~20 minutes end-to-end setup for Qwen3-8B on a single H100.
Links: Code, Paper.

Highlighted Details

Achieves ~7x KV-cache memory compression with single-digit percentage point (pp) accuracy drop on benchmarks like GPQA.
Outperforms prior INT2 KV-cache quantization methods (e.g., QuaRot, KIVI) in accuracy retention, often matching or exceeding BF16 baselines across reasoning, coding, and long-context tasks.
Supports a growing list of models including Qwen3.x, MiniMax-M2.7, and GLM-4.7, with ongoing development for larger contexts and agentic applications.
Offers a pre-computed "RotationZoo" for direct download, bypassing the calibration step.

Maintenance & Community

The project is actively updated with new model support and testing for advanced use cases. No specific community channels (e.g., Discord, Slack) or formal roadmap links are provided in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects. Built upon the SGLang framework.

Limitations & Caveats

Setup necessitates specific, high-end GPU hardware (multiple H100s) for efficient calibration and inference. Strict CUDA 12.8/12.9 dependency requires careful environment management. The integration relies on vendored code and compatibility shims, potentially complicating deep system-level debugging.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days