luminal  by luminal-ai

Deep learning library using composable compilers for high performance

Created 2 years ago
2,676 stars

Top 17.5% on SourcePulse

GitHubView on GitHub
Project Summary

Luminal is a Rust-based deep learning library designed for high-performance inference and training through a composable, ahead-of-time compilation approach. It targets developers seeking maximum efficiency on diverse hardware, from consumer CPUs and Apple Silicon to NVIDIA GPUs, by compiling computation graphs into optimized, native code.

How It Works

Luminal employs a compile-time-first philosophy, building static computation graphs from 11 primitive operations. This allows its compilers (e.g., CPUCompiler, MetalCompiler, CUDACompiler) to perform aggressive optimizations like kernel fusion and shape-specific code generation, treating the entire network as a single unit. This contrasts with eager execution, aiming to push complexity into the compiler for superior performance and hardware-specific tuning without manual code divergence.

Quick Start & Requirements

  • Install/Run: cargo run --release --features <cuda|metal|cpu> (after cd ./examples/llama and ./setup/setup.sh for Llama 3 example).
  • Prerequisites: Rust toolchain, CUDA Toolkit (for NVIDIA), Metal (for macOS).
  • Resources: Llama 3 8B example runs locally on M-series Macbooks at 15-25 tokens/sec.
  • Docs: https://github.com/jafioti/luminal/blob/main/README.md#getting-started

Highlighted Details

  • Achieves 15-25 tokens/sec for Q8 Llama 3 8B on M-series Macbooks.
  • Supports native compilation for CUDA and Metal, avoiding abstractions.
  • Offers full training support with graph-based autograd.
  • Implements examples for Llama 3, Phi 3, Whisper, and YOLO v8.

Maintenance & Community

  • Active development with a focus on compiler advancements and performance targets.
  • Roadmap includes optimizing CUDA/Metal kernels, distributed training, and matching PyTorch 2.0 performance.

Licensing & Compatibility

  • Licensed under Apache License 2.0 or MIT license, permitting commercial use and closed-source linking.

Limitations & Caveats

  • Still under active development with stated goals to match PyTorch API coverage and performance benchmarks.
  • Some optimizations and features, like distributed training, are on the roadmap rather than fully implemented.
Health Check
Last Commit

15 hours ago

Responsiveness

1 day

Pull Requests (30d)
19
Issues (30d)
6
Star History
42 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
17 more.

ThunderKittens by HazyResearch

0.5%
3k
CUDA kernel framework for fast deep learning primitives
Created 1 year ago
Updated 16 hours ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), and
7 more.

TransformerEngine by NVIDIA

0.9%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

0.4%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 11 months ago
Updated 5 days ago
Starred by François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
13 more.

neon by NervanaSystems

0%
4k
Deep learning framework (discontinued)
Created 11 years ago
Updated 5 years ago
Feedback? Help us improve.