calm by zeux

Single-GPU inference engine for rapid LLM prototyping

Created 2 years ago

624 stars

Top 53.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Zhuohan Li

Coauthor of vLLM

Project Summary

This project provides a minimal, dependency-light implementation for accelerating language model inference on CUDA and Metal GPUs. It targets researchers and power users seeking maximum single-GPU utilization for LLM architectures, enabling rapid prototyping and experimentation with various quantization formats and model variations.

How It Works

Calm focuses on maximizing hardware utilization by implementing LLM inference in C++ with minimal dependencies, leveraging CUDA for NVIDIA GPUs and Metal for Apple Silicon. It supports various quantization formats (fp16, fp8, gf4) to balance performance and memory usage, with fp8 offering a ~2x speedup and gf4 a ~75% speedup over fp16 at the cost of a small perplexity penalty. The implementation is designed to be bandwidth-bound, with optimizations for efficient KV cache access.

Quick Start & Requirements

Build: make
Run: ./build/run <model_path> -i "<prompt>"
Model Conversion: python tools/convert.py <output_path> <model_dir>
Dependencies: pip install -r tools/requirements.txt (for conversion scripts), git-lfs (for model downloads).
Hardware: NVIDIA GPU (CUDA) or Apple Silicon (Metal). Linux is the primary supported OS; macOS has experimental Metal support.
Docs: https://github.com/zeux/calm

Highlighted Details

Supports a wide range of LLM architectures including Llama, Mistral, Mixtral, Qwen2, Gemma, and Phi3.
Achieves high throughput, e.g., 246 tok/s on Llama2 7B (gf4) and 225 tok/s on Mistral 7B (gf4) on an RTX 4090.
Offers efficient inference on Apple Silicon, with up to 73 tok/s on Llama3 8B (gf4) on an M1 Max.
Optimized for fp8 and gf4 quantization, significantly reducing memory footprint and increasing speed.

Maintenance & Community

The project is maintained by zeux. There are no explicit community links (Discord/Slack) or roadmaps provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the minimal dependencies and focus on experimentation, it's likely permissive, but users should verify. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is explicitly stated as not production-ready or stable, prioritizing experimentation and prototyping. Prompt processing is currently serial, which may become a bottleneck. Performance on high-end Apple Silicon chips may not reach peak potential due to profiling limitations. Gemma 7B has reported issues with fp8 quantization.

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days