Llama-2-Onnx by microsoft

ONNX-optimized Llama 2 models for research/experimentation

Created 2 years ago

1,026 stars

Top 36.5% on SourcePulse

Project Summary

This repository provides optimized ONNX versions of Meta's Llama 2 models, targeting developers and researchers who need efficient inference for large language models. It offers pre-compiled ONNX models for various Llama 2 sizes (7B, 13B) and precisions (FP16, FP32), including fine-tuned variants for dialogue applications, enabling faster deployment and experimentation.

How It Works

The project leverages the ONNX Runtime for optimized inference, converting Llama 2's architecture into a graph representation that can be efficiently executed across diverse hardware. Llama 2 itself features a decoder-only transformer architecture with Grouped Query Attention (GQA) for improved efficiency and a reduced hidden size projection (2.7x) in its feed-forward layers compared to standard transformers.

Quick Start & Requirements

Install: Clone the repository and initialize/update specific submodules for desired model versions (e.g., git submodule init 7B_FT_float16).
Prerequisites: Git LFS for handling large model files (approx. 10GB for 7B FP16). Access to ONNX files requires signing up via a provided link.
Examples: Includes a MinimumExample for command-line text completion and a ChatApp using Gradio for a chatbot interface.

Highlighted Details

Optimized ONNX versions of Llama 2 (7B, 13B) in FP16 and FP32.
Includes fine-tuned models for dialogue applications, requiring specific input formatting.
ONNX Runtime enables hardware-specific optimizations and JIT compilation for faster subsequent inferences.
Supports temperature and top-p sampling for controlling text generation.

Maintenance & Community

Maintained by Microsoft, with links to Meta's Llama 2 model card and responsible use guidelines. No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

Uses the Llama Community License Agreement. Microsoft's contributions are subject to this license. Commercial use is permitted under the terms of the Llama Community License.

Limitations & Caveats

Access to ONNX model files is gated via a sign-up process. FP16 ONNX performance may be slower than FP32 if the target hardware lacks native FP16 support, leading to runtime casts.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days