Discover and explore top open-source AI tools and projects—updated daily.
FutureMLS-LabLLM KV cache optimization via 2-bit quantization
New!
Top 69.4% on SourcePulse
Summary
OSCAR addresses the significant memory overhead of KV caches in Large Language Models (LLMs) by introducing an offline, attention-aware 2-bit quantization technique. It targets engineers and researchers seeking to enable longer context windows and more efficient LLM deployment, offering substantial memory compression with minimal accuracy degradation.
How It Works
The core innovation lies in estimating attention-aware K/V covariance structures offline using a calibration set. OSCAR derives per-layer rotations and clipping thresholds that align KV quantization with the actual directions attention consumes. This approach utilizes INT2 storage for the bulk of the KV cache, augmented by a small BF16 sink and recent window, achieving approximately 7x memory reduction versus standard BF16.
Quick Start & Requirements
git clone --recursive), create Conda environment (conda create -n oscar python=3.12 -y), activate (conda activate oscar), install (pip install -e sglang-research/python).Highlighted Details
Maintenance & Community
The project is actively updated with new model support and testing for advanced use cases. No specific community channels (e.g., Discord, Slack) or formal roadmap links are provided in the README.
Licensing & Compatibility
Limitations & Caveats
Setup necessitates specific, high-end GPU hardware (multiple H100s) for efficient calibration and inference. Strict CUDA 12.8/12.9 dependency requires careful environment management. The integration relies on vendored code and compatibility shims, potentially complicating deep system-level debugging.
11 hours ago
Inactive