zml  by zml

AI inference stack for production

created 10 months ago
2,453 stars

Top 19.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ZML is a high-performance AI inference stack designed for production environments, enabling the deployment of diverse AI models across heterogeneous hardware. It targets AI engineers and researchers seeking a unified, efficient, and flexible platform for deploying models, particularly in distributed or multi-accelerator setups.

How It Works

ZML leverages the Zig programming language for its performance and safety features, combined with MLIR for flexible model compilation and optimization. The stack is built using Bazel for robust dependency management and cross-compilation. Its core advantage lies in its ability to abstract hardware complexities, allowing models to be compiled and run across different accelerators (NVIDIA, AMD, TPUs, etc.) with minimal code changes, facilitating distributed inference across geographically dispersed hardware.

Quick Start & Requirements

  • Install Bazel: Recommended via bazelisk (macOS: brew install bazelisk, Linux: download binary).
  • Run MNIST: cd examples && bazel run -c opt //mnist
  • Run Llama 3.1 8B: Requires HuggingFace approval and token. cd examples && bazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="What is the capital of France?"
  • GPU/TPU Support: Append --@zml//runtimes:cuda=true, --@zml//runtimes:rocm=true, or --@zml//runtimes:tpu=true to Bazel commands.
  • Documentation: Website, Getting Started, Documentation

Highlighted Details

  • Supports distributed inference across multiple, geographically diverse accelerators (e.g., NVIDIA, AMD, Google TPU).
  • Built with Zig, MLIR, and Bazel for performance, flexibility, and robust builds.
  • Demonstrates running sharded Llama 2 models across NVIDIA, AMD, and Google Cloud TPUs over a VPN.
  • Includes examples for MNIST and various Meta Llama models.

Maintenance & Community

  • Active development with contributions from ZML and OpenXLA.
  • Community support via Discord.
  • Contributing guidelines available.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Some models, like Meta's Llama series, require explicit approval and access tokens from the model provider.
  • The project is actively under development, and specific hardware acceleration features might be experimental.
Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
27
Issues (30d)
3
Star History
238 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 16 hours ago
Feedback? Help us improve.