Llama-2-Onnx  by microsoft

ONNX-optimized Llama 2 models for research/experimentation

created 2 years ago
1,028 stars

Top 37.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides optimized ONNX versions of Meta's Llama 2 models, targeting developers and researchers who need efficient inference for large language models. It offers pre-compiled ONNX models for various Llama 2 sizes (7B, 13B) and precisions (FP16, FP32), including fine-tuned variants for dialogue applications, enabling faster deployment and experimentation.

How It Works

The project leverages the ONNX Runtime for optimized inference, converting Llama 2's architecture into a graph representation that can be efficiently executed across diverse hardware. Llama 2 itself features a decoder-only transformer architecture with Grouped Query Attention (GQA) for improved efficiency and a reduced hidden size projection (2.7x) in its feed-forward layers compared to standard transformers.

Quick Start & Requirements

  • Install: Clone the repository and initialize/update specific submodules for desired model versions (e.g., git submodule init 7B_FT_float16).
  • Prerequisites: Git LFS for handling large model files (approx. 10GB for 7B FP16). Access to ONNX files requires signing up via a provided link.
  • Examples: Includes a MinimumExample for command-line text completion and a ChatApp using Gradio for a chatbot interface.

Highlighted Details

  • Optimized ONNX versions of Llama 2 (7B, 13B) in FP16 and FP32.
  • Includes fine-tuned models for dialogue applications, requiring specific input formatting.
  • ONNX Runtime enables hardware-specific optimizations and JIT compilation for faster subsequent inferences.
  • Supports temperature and top-p sampling for controlling text generation.

Maintenance & Community

Maintained by Microsoft, with links to Meta's Llama 2 model card and responsible use guidelines. No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

Uses the Llama Community License Agreement. Microsoft's contributions are subject to this license. Commercial use is permitted under the terms of the Llama Community License.

Limitations & Caveats

Access to ONNX model files is gated via a sign-up process. FP16 ONNX performance may be slower than FP32 if the target hardware lacks native FP16 support, leading to runtime casts.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.