onnxruntime-genai by microsoft

GenAI extension for running LLMs with ONNX Runtime

Created 2 years ago

926 stars

Top 39.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Tim J. Baek

Founder of Open WebUI

Project Summary

This project provides Generative AI extensions for ONNX Runtime, enabling efficient on-device execution of Large Language Models (LLMs). It targets developers and researchers seeking a flexible and performant solution for LLM inference, offering a complete generative AI loop including pre/post-processing, inference, and sampling.

How It Works

The library implements the full generative AI loop for ONNX models. It handles tokenization, inference via ONNX Runtime, logits processing, search and sampling strategies, and KV cache management. This integrated approach simplifies the deployment of LLMs by abstracting complex pipeline components, allowing users to focus on model integration and application logic.

Quick Start & Requirements

Install: pip install numpy and pip install --pre onnxruntime-genai
Prerequisites: Python, NumPy. Specific models may require downloading ONNX-formatted models (e.g., from Hugging Face). Hardware acceleration (CUDA, DirectML) is supported but not strictly required.
Documentation: https://onnxruntime.ai/docs/genai
Sample Code: Provided for Phi-3.

Highlighted Details

Supports multiple model architectures including Gemma, Llama, Mistral, Phi, Qwen, and more.
Offers Python, C#, C/C++, and Java APIs, with cross-platform support for Linux, Windows, macOS, and Android.
Features hardware acceleration via CUDA and DirectML, with QNN, OpenVINO, and ROCm under development.
Includes advanced features like MultiLoRA, continuous decoding, and constrained decoding.

Maintenance & Community

The project is actively maintained by Microsoft. Discussions for feature requests and community engagement are available via GitHub Discussions. Contributions are welcome, subject to a Contributor License Agreement (CLA).

Licensing & Compatibility

The project is released under a permissive license, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

A breaking API change occurred between release candidates 0.7.0-rc2 and release 0.7.0 in the tokenizer.encode method, which now returns NumPy arrays instead of Python lists. Support for certain platforms (iOS) and hardware acceleration (ROCm) is under development or requires building from source. Examples in the main branch may not always align with the latest stable release.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days