llama.onnx by tpoisonooo

ONNX models for LLaMa/RWKV, plus quantization and test cases

Created 2 years ago

367 stars

Top 76.9% on SourcePulse

Project Summary

This repository provides ONNX models for LLaMa and RWKV large language models, focusing on quantization and efficient inference. It targets developers and researchers aiming to deploy LLMs on diverse hardware, including resource-constrained devices, by leveraging ONNX for cross-platform compatibility and reduced model size.

How It Works

The project converts LLaMa and RWKV models into the ONNX format, enabling inference without PyTorch or Transformers dependencies. It supports quantization techniques, demonstrated by exporting quantization tables from GPTQ-for-LLaMa, and includes features like a memory pool to allow inference on systems with as little as 2GB of RAM, albeit with significantly reduced speed. The ONNX export process involves converting Hugging Face format models and then optionally converting them to FP16 precision.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Run LLaMa demo: python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"
Run RWKV demo: python3 demo_rwkv.py ${FP16_ONNX_DIR} "bonjour"
Requires Python 3.x.
For LLaMa-7B FP16, a minimum of 13GB disk space is needed for the model.
Demo scripts require ONNX Runtime.
Links: LLaMa GPU inference issue, Model download URLs, LLaMa and RWKV structure comparison

Highlighted Details

Standalone ONNX Runtime demos for LLaMa and RWKV.
Memory pool support for low-RAM devices (e.g., 2GB RAM laptops).
Quantization table export from GPTQ-for-LLaMa.
ONNX models available in FP32 and FP16 precision.
Verified ONNX Runtime output against Torch-CUDA with a maximum error of 0.002.

Maintenance & Community

Project initiated April 5th.
Recent updates include RWKV-4 ONNX models and standalone scripts.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

License: GPLv3.
GPLv3 is a strong copyleft license, potentially restricting commercial use or linking with closed-source software without adhering to its terms.

Limitations & Caveats

The project notes that inference on 2GB RAM devices is "very slow." The conversion process for LLaMa ONNX models requires specific versions of the Transformers library and involves multiple manual steps. Mixed-precision kernel optimization is noted as "on the way."

llama.onnx by tpoisonooo

Explore Similar Projects

llm-export by wangzhaode

lm.rs by samuel-vitorino

huggingface-llama-recipes by huggingface

alpaca_lora_4bit by johnsmith0031

InferLLM by MegEngine

ComfyUI-nunchaku by nunchaku-tech

neural-compressor by intel

mlx-lm by ml-explore

gemma_pytorch by google

airllm by lyogavin

PINTO_model_zoo by PINTO0309

BitNet by microsoft