uzu  by trymirai

High-performance inference engine for Apple Silicon

created 1 month ago
1,110 stars

Top 35.1% on sourcepulse

GitHubView on GitHub
Project Summary

Uzu is a high-performance inference engine designed for AI models on Apple Silicon, offering a simple API and a hybrid architecture for efficient computation. It targets developers and researchers working with AI models on macOS and iOS, enabling faster inference through optimized Metal and MPSGraph utilization.

How It Works

Uzu employs a hybrid architecture, allowing layers to be computed either as GPU kernels or via MPSGraph, which provides low-level access to Apple's Neural Engine (ANE). This approach, combined with unified memory on Apple devices, aims to maximize performance. It also features traceable computations for verifying correctness against source implementations and a unified model configuration system for easier model integration.

Quick Start & Requirements

  • Install: Add uzu as a dependency in your Cargo.toml or use the prebuilt Swift framework.
  • Prerequisites: Apple Silicon Mac, Rust toolchain.
  • Model Conversion: Use the lalamo tool (included) to convert models to Uzu's format (e.g., uv run lalamo convert meta-llama/Llama-3.2-1B-Instruct --precision float16).
  • Documentation: https://github.com/trymirai/uzu

Highlighted Details

  • Achieves competitive inference speeds, often outperforming llama.cpp on Apple Silicon for various models (e.g., 35.17 tokens/s for Llama-3.2-1B-Instruct on M2).
  • Supports model conversion and provides a CLI for running and serving models.
  • Offers a Swift framework for easier integration into macOS/iOS applications.
  • Utilizes bf16/f16 precision for performance benchmarks.

Maintenance & Community

No specific community links or contributor details are provided in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Uzu uses its own model format, requiring a conversion step for existing models. Performance comparisons are based on specific configurations and may vary. The README notes that comparing quantized models directly can be complex due to differing quantization methods.

Health Check
Last commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
16
Issues (30d)
3
Star History
1,155 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.