zml  by zml

AI inference stack for production

Created 1 year ago
2,562 stars

Top 18.2% on SourcePulse

GitHubView on GitHub
Project Summary

ZML is a high-performance AI inference stack designed for production environments, enabling the deployment of diverse AI models across heterogeneous hardware. It targets AI engineers and researchers seeking a unified, efficient, and flexible platform for deploying models, particularly in distributed or multi-accelerator setups.

How It Works

ZML leverages the Zig programming language for its performance and safety features, combined with MLIR for flexible model compilation and optimization. The stack is built using Bazel for robust dependency management and cross-compilation. Its core advantage lies in its ability to abstract hardware complexities, allowing models to be compiled and run across different accelerators (NVIDIA, AMD, TPUs, etc.) with minimal code changes, facilitating distributed inference across geographically dispersed hardware.

Quick Start & Requirements

  • Install Bazel: Recommended via bazelisk (macOS: brew install bazelisk, Linux: download binary).
  • Run MNIST: cd examples && bazel run -c opt //mnist
  • Run Llama 3.1 8B: Requires HuggingFace approval and token. cd examples && bazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="What is the capital of France?"
  • GPU/TPU Support: Append --@zml//runtimes:cuda=true, --@zml//runtimes:rocm=true, or --@zml//runtimes:tpu=true to Bazel commands.
  • Documentation: Website, Getting Started, Documentation

Highlighted Details

  • Supports distributed inference across multiple, geographically diverse accelerators (e.g., NVIDIA, AMD, Google TPU).
  • Built with Zig, MLIR, and Bazel for performance, flexibility, and robust builds.
  • Demonstrates running sharded Llama 2 models across NVIDIA, AMD, and Google Cloud TPUs over a VPN.
  • Includes examples for MNIST and various Meta Llama models.

Maintenance & Community

  • Active development with contributions from ZML and OpenXLA.
  • Community support via Discord.
  • Contributing guidelines available.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Some models, like Meta's Llama series, require explicit approval and access tokens from the model provider.
  • The project is actively under development, and specific hardware acceleration features might be experimental.
Health Check
Last Commit

14 hours ago

Responsiveness

1 day

Pull Requests (30d)
18
Issues (30d)
1
Star History
68 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.3%
4k
AI inference pipeline framework
Created 1 year ago
Updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
54 more.

llama.cpp by ggml-org

0.4%
87k
C/C++ library for local LLM inference
Created 2 years ago
Updated 14 hours ago
Feedback? Help us improve.