onnxruntime-genai  by microsoft

GenAI extension for running LLMs with ONNX Runtime

Created 1 year ago
829 stars

Top 42.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides Generative AI extensions for ONNX Runtime, enabling efficient on-device execution of Large Language Models (LLMs). It targets developers and researchers seeking a flexible and performant solution for LLM inference, offering a complete generative AI loop including pre/post-processing, inference, and sampling.

How It Works

The library implements the full generative AI loop for ONNX models. It handles tokenization, inference via ONNX Runtime, logits processing, search and sampling strategies, and KV cache management. This integrated approach simplifies the deployment of LLMs by abstracting complex pipeline components, allowing users to focus on model integration and application logic.

Quick Start & Requirements

  • Install: pip install numpy and pip install --pre onnxruntime-genai
  • Prerequisites: Python, NumPy. Specific models may require downloading ONNX-formatted models (e.g., from Hugging Face). Hardware acceleration (CUDA, DirectML) is supported but not strictly required.
  • Documentation: https://onnxruntime.ai/docs/genai
  • Sample Code: Provided for Phi-3.

Highlighted Details

  • Supports multiple model architectures including Gemma, Llama, Mistral, Phi, Qwen, and more.
  • Offers Python, C#, C/C++, and Java APIs, with cross-platform support for Linux, Windows, macOS, and Android.
  • Features hardware acceleration via CUDA and DirectML, with QNN, OpenVINO, and ROCm under development.
  • Includes advanced features like MultiLoRA, continuous decoding, and constrained decoding.

Maintenance & Community

The project is actively maintained by Microsoft. Discussions for feature requests and community engagement are available via GitHub Discussions. Contributions are welcome, subject to a Contributor License Agreement (CLA).

Licensing & Compatibility

The project is released under a permissive license, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

A breaking API change occurred between release candidates 0.7.0-rc2 and release 0.7.0 in the tokenizer.encode method, which now returns NumPy arrays instead of Python lists. Support for certain platforms (iOS) and hardware acceleration (ROCm) is under development or requires building from source. Examples in the main branch may not always align with the latest stable release.

Health Check
Last Commit

12 hours ago

Responsiveness

1 day

Pull Requests (30d)
71
Issues (30d)
17
Star History
34 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.