calm  by zeux

Single-GPU inference engine for rapid LLM prototyping

Created 1 year ago
613 stars

Top 53.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a minimal, dependency-light implementation for accelerating language model inference on CUDA and Metal GPUs. It targets researchers and power users seeking maximum single-GPU utilization for LLM architectures, enabling rapid prototyping and experimentation with various quantization formats and model variations.

How It Works

Calm focuses on maximizing hardware utilization by implementing LLM inference in C++ with minimal dependencies, leveraging CUDA for NVIDIA GPUs and Metal for Apple Silicon. It supports various quantization formats (fp16, fp8, gf4) to balance performance and memory usage, with fp8 offering a ~2x speedup and gf4 a ~75% speedup over fp16 at the cost of a small perplexity penalty. The implementation is designed to be bandwidth-bound, with optimizations for efficient KV cache access.

Quick Start & Requirements

  • Build: make
  • Run: ./build/run <model_path> -i "<prompt>"
  • Model Conversion: python tools/convert.py <output_path> <model_dir>
  • Dependencies: pip install -r tools/requirements.txt (for conversion scripts), git-lfs (for model downloads).
  • Hardware: NVIDIA GPU (CUDA) or Apple Silicon (Metal). Linux is the primary supported OS; macOS has experimental Metal support.
  • Docs: https://github.com/zeux/calm

Highlighted Details

  • Supports a wide range of LLM architectures including Llama, Mistral, Mixtral, Qwen2, Gemma, and Phi3.
  • Achieves high throughput, e.g., 246 tok/s on Llama2 7B (gf4) and 225 tok/s on Mistral 7B (gf4) on an RTX 4090.
  • Offers efficient inference on Apple Silicon, with up to 73 tok/s on Llama3 8B (gf4) on an M1 Max.
  • Optimized for fp8 and gf4 quantization, significantly reducing memory footprint and increasing speed.

Maintenance & Community

The project is maintained by zeux. There are no explicit community links (Discord/Slack) or roadmaps provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the minimal dependencies and focus on experimentation, it's likely permissive, but users should verify. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is explicitly stated as not production-ready or stable, prioritizing experimentation and prototyping. Prompt processing is currently serial, which may become a bottleneck. Performance on high-end Apple Silicon chips may not reach peak potential due to profiling limitations. Gemma 7B has reported issues with fp8 quantization.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

GPTFast by MDK8888

0%
687
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.