calm  by zeux

Single-GPU inference engine for rapid LLM prototyping

created 1 year ago
599 stars

Top 55.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a minimal, dependency-light implementation for accelerating language model inference on CUDA and Metal GPUs. It targets researchers and power users seeking maximum single-GPU utilization for LLM architectures, enabling rapid prototyping and experimentation with various quantization formats and model variations.

How It Works

Calm focuses on maximizing hardware utilization by implementing LLM inference in C++ with minimal dependencies, leveraging CUDA for NVIDIA GPUs and Metal for Apple Silicon. It supports various quantization formats (fp16, fp8, gf4) to balance performance and memory usage, with fp8 offering a ~2x speedup and gf4 a ~75% speedup over fp16 at the cost of a small perplexity penalty. The implementation is designed to be bandwidth-bound, with optimizations for efficient KV cache access.

Quick Start & Requirements

  • Build: make
  • Run: ./build/run <model_path> -i "<prompt>"
  • Model Conversion: python tools/convert.py <output_path> <model_dir>
  • Dependencies: pip install -r tools/requirements.txt (for conversion scripts), git-lfs (for model downloads).
  • Hardware: NVIDIA GPU (CUDA) or Apple Silicon (Metal). Linux is the primary supported OS; macOS has experimental Metal support.
  • Docs: https://github.com/zeux/calm

Highlighted Details

  • Supports a wide range of LLM architectures including Llama, Mistral, Mixtral, Qwen2, Gemma, and Phi3.
  • Achieves high throughput, e.g., 246 tok/s on Llama2 7B (gf4) and 225 tok/s on Mistral 7B (gf4) on an RTX 4090.
  • Offers efficient inference on Apple Silicon, with up to 73 tok/s on Llama3 8B (gf4) on an M1 Max.
  • Optimized for fp8 and gf4 quantization, significantly reducing memory footprint and increasing speed.

Maintenance & Community

The project is maintained by zeux. There are no explicit community links (Discord/Slack) or roadmaps provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the minimal dependencies and focus on experimentation, it's likely permissive, but users should verify. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is explicitly stated as not production-ready or stable, prioritizing experimentation and prototyping. Prompt processing is currently serial, which may become a bottleneck. Performance on high-end Apple Silicon chips may not reach peak potential due to profiling limitations. Gemma 7B has reported issues with fp8 quantization.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
53 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 10 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.