zml by zml

AI inference stack for production

Created 1 year ago

3,046 stars

Top 15.6% on SourcePulse

View on GitHub

5 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Luis Capelo

Cofounder of Lightning AI

Solomon Hykes

Cofounder of Docker, Dagger

Sasha Rush

Research Scientist at Cursor; Professor at Cornell Tech

and 1 more!

Project Summary

ZML is a high-performance AI inference stack designed for production environments, enabling the deployment of diverse AI models across heterogeneous hardware. It targets AI engineers and researchers seeking a unified, efficient, and flexible platform for deploying models, particularly in distributed or multi-accelerator setups.

How It Works

ZML leverages the Zig programming language for its performance and safety features, combined with MLIR for flexible model compilation and optimization. The stack is built using Bazel for robust dependency management and cross-compilation. Its core advantage lies in its ability to abstract hardware complexities, allowing models to be compiled and run across different accelerators (NVIDIA, AMD, TPUs, etc.) with minimal code changes, facilitating distributed inference across geographically dispersed hardware.

Quick Start & Requirements

Install Bazel: Recommended via bazelisk (macOS: brew install bazelisk, Linux: download binary).
Run MNIST: cd examples && bazel run -c opt //mnist
Run Llama 3.1 8B: Requires HuggingFace approval and token. cd examples && bazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="What is the capital of France?"
GPU/TPU Support: Append --@zml//runtimes:cuda=true, --@zml//runtimes:rocm=true, or --@zml//runtimes:tpu=true to Bazel commands.
Documentation: Website, Getting Started, Documentation

Highlighted Details

Supports distributed inference across multiple, geographically diverse accelerators (e.g., NVIDIA, AMD, Google TPU).
Built with Zig, MLIR, and Bazel for performance, flexibility, and robust builds.
Demonstrates running sharded Llama 2 models across NVIDIA, AMD, and Google Cloud TPUs over a VPN.
Includes examples for MNIST and various Meta Llama models.

Maintenance & Community

Active development with contributions from ZML and OpenXLA.
Community support via Discord.
Contributing guidelines available.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Some models, like Meta's Llama series, require explicit approval and access tokens from the model provider.
The project is actively under development, and specific hardware acceleration features might be experimental.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

76 stars in the last 30 days