ONNX-optimized Llama 2 models for research/experimentation
Top 37.1% on sourcepulse
This repository provides optimized ONNX versions of Meta's Llama 2 models, targeting developers and researchers who need efficient inference for large language models. It offers pre-compiled ONNX models for various Llama 2 sizes (7B, 13B) and precisions (FP16, FP32), including fine-tuned variants for dialogue applications, enabling faster deployment and experimentation.
How It Works
The project leverages the ONNX Runtime for optimized inference, converting Llama 2's architecture into a graph representation that can be efficiently executed across diverse hardware. Llama 2 itself features a decoder-only transformer architecture with Grouped Query Attention (GQA) for improved efficiency and a reduced hidden size projection (2.7x) in its feed-forward layers compared to standard transformers.
Quick Start & Requirements
git submodule init 7B_FT_float16
).MinimumExample
for command-line text completion and a ChatApp
using Gradio for a chatbot interface.Highlighted Details
Maintenance & Community
Maintained by Microsoft, with links to Meta's Llama 2 model card and responsible use guidelines. No specific community channels (Discord/Slack) are listed.
Licensing & Compatibility
Uses the Llama Community License Agreement. Microsoft's contributions are subject to this license. Commercial use is permitted under the terms of the Llama Community License.
Limitations & Caveats
Access to ONNX model files is gated via a sign-up process. FP16 ONNX performance may be slower than FP32 if the target hardware lacks native FP16 support, leading to runtime casts.
1 year ago
Inactive