Single-GPU inference engine for rapid LLM prototyping
Top 55.3% on sourcepulse
This project provides a minimal, dependency-light implementation for accelerating language model inference on CUDA and Metal GPUs. It targets researchers and power users seeking maximum single-GPU utilization for LLM architectures, enabling rapid prototyping and experimentation with various quantization formats and model variations.
How It Works
Calm focuses on maximizing hardware utilization by implementing LLM inference in C++ with minimal dependencies, leveraging CUDA for NVIDIA GPUs and Metal for Apple Silicon. It supports various quantization formats (fp16, fp8, gf4) to balance performance and memory usage, with fp8 offering a ~2x speedup and gf4 a ~75% speedup over fp16 at the cost of a small perplexity penalty. The implementation is designed to be bandwidth-bound, with optimizations for efficient KV cache access.
Quick Start & Requirements
make
./build/run <model_path> -i "<prompt>"
python tools/convert.py <output_path> <model_dir>
pip install -r tools/requirements.txt
(for conversion scripts), git-lfs
(for model downloads).Highlighted Details
Maintenance & Community
The project is maintained by zeux. There are no explicit community links (Discord/Slack) or roadmaps provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Given the minimal dependencies and focus on experimentation, it's likely permissive, but users should verify. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is explicitly stated as not production-ready or stable, prioritizing experimentation and prototyping. Prompt processing is currently serial, which may become a bottleneck. Performance on high-end Apple Silicon chips may not reach peak potential due to profiling limitations. Gemma 7B has reported issues with fp8 quantization.
2 months ago
Inactive