luminal  by luminal-ai

Deep learning library using composable compilers for high performance

Created 2 years ago
2,592 stars

Top 18.1% on SourcePulse

GitHubView on GitHub
Project Summary

Luminal is a Rust-based deep learning library designed for high-performance inference and training through a composable, ahead-of-time compilation approach. It targets developers seeking maximum efficiency on diverse hardware, from consumer CPUs and Apple Silicon to NVIDIA GPUs, by compiling computation graphs into optimized, native code.

How It Works

Luminal employs a compile-time-first philosophy, building static computation graphs from 11 primitive operations. This allows its compilers (e.g., CPUCompiler, MetalCompiler, CUDACompiler) to perform aggressive optimizations like kernel fusion and shape-specific code generation, treating the entire network as a single unit. This contrasts with eager execution, aiming to push complexity into the compiler for superior performance and hardware-specific tuning without manual code divergence.

Quick Start & Requirements

  • Install/Run: cargo run --release --features <cuda|metal|cpu> (after cd ./examples/llama and ./setup/setup.sh for Llama 3 example).
  • Prerequisites: Rust toolchain, CUDA Toolkit (for NVIDIA), Metal (for macOS).
  • Resources: Llama 3 8B example runs locally on M-series Macbooks at 15-25 tokens/sec.
  • Docs: https://github.com/jafioti/luminal/blob/main/README.md#getting-started

Highlighted Details

  • Achieves 15-25 tokens/sec for Q8 Llama 3 8B on M-series Macbooks.
  • Supports native compilation for CUDA and Metal, avoiding abstractions.
  • Offers full training support with graph-based autograd.
  • Implements examples for Llama 3, Phi 3, Whisper, and YOLO v8.

Maintenance & Community

  • Active development with a focus on compiler advancements and performance targets.
  • Roadmap includes optimizing CUDA/Metal kernels, distributed training, and matching PyTorch 2.0 performance.

Licensing & Compatibility

  • Licensed under Apache License 2.0 or MIT license, permitting commercial use and closed-source linking.

Limitations & Caveats

  • Still under active development with stated goals to match PyTorch API coverage and performance benchmarks.
  • Some optimizations and features, like distributed training, are on the roadmap rather than fully implemented.
Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
10
Issues (30d)
2
Star History
59 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
16 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
Created 1 year ago
Updated 1 week ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), and
7 more.

TransformerEngine by NVIDIA

0.7%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 13 hours ago
Starred by Nathan Lambert Nathan Lambert(Research Scientist at AI2), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
7 more.

DeepGEMM by deepseek-ai

0.3%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 8 months ago
Updated 2 weeks ago
Starred by François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
13 more.

neon by NervanaSystems

0%
4k
Deep learning framework (discontinued)
Created 11 years ago
Updated 4 years ago
Feedback? Help us improve.