llama.onnx  by tpoisonooo

ONNX models for LLaMa/RWKV, plus quantization and test cases

Created 2 years ago
365 stars

Top 77.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides ONNX models for LLaMa and RWKV large language models, focusing on quantization and efficient inference. It targets developers and researchers aiming to deploy LLMs on diverse hardware, including resource-constrained devices, by leveraging ONNX for cross-platform compatibility and reduced model size.

How It Works

The project converts LLaMa and RWKV models into the ONNX format, enabling inference without PyTorch or Transformers dependencies. It supports quantization techniques, demonstrated by exporting quantization tables from GPTQ-for-LLaMa, and includes features like a memory pool to allow inference on systems with as little as 2GB of RAM, albeit with significantly reduced speed. The ONNX export process involves converting Hugging Face format models and then optionally converting them to FP16 precision.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Run LLaMa demo: python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"
  • Run RWKV demo: python3 demo_rwkv.py ${FP16_ONNX_DIR} "bonjour"
  • Requires Python 3.x.
  • For LLaMa-7B FP16, a minimum of 13GB disk space is needed for the model.
  • Demo scripts require ONNX Runtime.
  • Links: LLaMa GPU inference issue, Model download URLs, LLaMa and RWKV structure comparison

Highlighted Details

  • Standalone ONNX Runtime demos for LLaMa and RWKV.
  • Memory pool support for low-RAM devices (e.g., 2GB RAM laptops).
  • Quantization table export from GPTQ-for-LLaMa.
  • ONNX models available in FP32 and FP16 precision.
  • Verified ONNX Runtime output against Torch-CUDA with a maximum error of 0.002.

Maintenance & Community

  • Project initiated April 5th.
  • Recent updates include RWKV-4 ONNX models and standalone scripts.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • License: GPLv3.
  • GPLv3 is a strong copyleft license, potentially restricting commercial use or linking with closed-source software without adhering to its terms.

Limitations & Caveats

The project notes that inference on 2GB RAM devices is "very slow." The conversion process for LLaMa ONNX models requires specific versions of the Transformers library and involves multiple manual steps. Mixed-precision kernel optimization is noted as "on the way."

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech) and Clément Renault Clément Renault(Cofounder of Meilisearch).

lm.rs by samuel-vitorino

0%
1k
Minimal LLM inference in Rust
Created 1 year ago
Updated 10 months ago
Starred by Didier Lopes Didier Lopes(Founder of OpenBB), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

mlx-lm by ml-explore

26.1%
2k
Python package for LLM text generation and fine-tuning on Apple silicon
Created 6 months ago
Updated 1 day ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 17 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.