Discover and explore top open-source AI tools and projects—updated daily.
TheTomLLM KV cache compression for efficient local inference
New!
Top 53.4% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> TurboQuant+ addresses the significant memory overhead of LLM KV caches for local inference. It offers advanced compression techniques, enabling larger models and longer contexts on consumer hardware with minimal performance degradation. The project targets engineers and researchers seeking efficient LLM deployment.
How It Works
The core approach compresses transformer KV caches using a two-stage process: PolarQuant (b-1 bits) involving random rotation and scalar quantization, followed by QJL (1 bit) for unbiased inner product correction. This yields a CompressedVector representation. This method achieves high compression ratios (up to 4.6x) while maintaining near-zero speed penalties and high fidelity compared to uncompressed or standard quantized caches.
Quick Start & Requirements
pip install -e ".[dev]". Run tests with python3 -m pytest tests/ -v.llama-cpp-turboquant fork. Build with Metal (Apple Silicon) or CUDA (NVIDIA) using CMake.
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -jcmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j (untested)torch, transformers, accelerate.benchmarks/demo.py, benchmarks/validate_real_model.py.Highlighted Details
turbo3 type) with q8_0 speed parity on Apple Silicon (M5 Max: 2747 vs 2694 tok/s prefill).Maintenance & Community
The project is v1 complete, speed-optimized, and community-tested with 511+ Python tests and 100% code coverage. A C port is integrated into a llama.cpp fork with Metal GPU kernels. Over 10 testers have contributed across Mac and NVIDIA hardware. The roadmap indicates ongoing work on CUDA backend, benchmark hardening, and advanced extensions.
Licensing & Compatibility
Licensed under the Apache License 2.0, permitting commercial use and integration into closed-source projects.
Limitations & Caveats
The CUDA backend is still under development. Advanced "Plus" extensions like adaptive bit allocation and temporal decay are in experimental branches or planned. Integration requires using a specific llama.cpp fork, and upstream coordination is ongoing. The turbo4 variant is noted as broken and requires updates.
16 hours ago
Inactive
FMInference