onnxruntime-genai  by microsoft

GenAI extension for running LLMs with ONNX Runtime

created 1 year ago
771 stars

Top 46.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides Generative AI extensions for ONNX Runtime, enabling efficient on-device execution of Large Language Models (LLMs). It targets developers and researchers seeking a flexible and performant solution for LLM inference, offering a complete generative AI loop including pre/post-processing, inference, and sampling.

How It Works

The library implements the full generative AI loop for ONNX models. It handles tokenization, inference via ONNX Runtime, logits processing, search and sampling strategies, and KV cache management. This integrated approach simplifies the deployment of LLMs by abstracting complex pipeline components, allowing users to focus on model integration and application logic.

Quick Start & Requirements

  • Install: pip install numpy and pip install --pre onnxruntime-genai
  • Prerequisites: Python, NumPy. Specific models may require downloading ONNX-formatted models (e.g., from Hugging Face). Hardware acceleration (CUDA, DirectML) is supported but not strictly required.
  • Documentation: https://onnxruntime.ai/docs/genai
  • Sample Code: Provided for Phi-3.

Highlighted Details

  • Supports multiple model architectures including Gemma, Llama, Mistral, Phi, Qwen, and more.
  • Offers Python, C#, C/C++, and Java APIs, with cross-platform support for Linux, Windows, macOS, and Android.
  • Features hardware acceleration via CUDA and DirectML, with QNN, OpenVINO, and ROCm under development.
  • Includes advanced features like MultiLoRA, continuous decoding, and constrained decoding.

Maintenance & Community

The project is actively maintained by Microsoft. Discussions for feature requests and community engagement are available via GitHub Discussions. Contributions are welcome, subject to a Contributor License Agreement (CLA).

Licensing & Compatibility

The project is released under a permissive license, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

A breaking API change occurred between release candidates 0.7.0-rc2 and release 0.7.0 in the tokenizer.encode method, which now returns NumPy arrays instead of Python lists. Support for certain platforms (iOS) and hardware acceleration (ROCm) is under development or requires building from source. Examples in the main branch may not always align with the latest stable release.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
71
Issues (30d)
20
Star History
74 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 16 hours ago
Feedback? Help us improve.