llama.onnx  by tpoisonooo

ONNX models for LLaMa/RWKV, plus quantization and test cases

created 2 years ago
362 stars

Top 78.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides ONNX models for LLaMa and RWKV large language models, focusing on quantization and efficient inference. It targets developers and researchers aiming to deploy LLMs on diverse hardware, including resource-constrained devices, by leveraging ONNX for cross-platform compatibility and reduced model size.

How It Works

The project converts LLaMa and RWKV models into the ONNX format, enabling inference without PyTorch or Transformers dependencies. It supports quantization techniques, demonstrated by exporting quantization tables from GPTQ-for-LLaMa, and includes features like a memory pool to allow inference on systems with as little as 2GB of RAM, albeit with significantly reduced speed. The ONNX export process involves converting Hugging Face format models and then optionally converting them to FP16 precision.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Run LLaMa demo: python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"
  • Run RWKV demo: python3 demo_rwkv.py ${FP16_ONNX_DIR} "bonjour"
  • Requires Python 3.x.
  • For LLaMa-7B FP16, a minimum of 13GB disk space is needed for the model.
  • Demo scripts require ONNX Runtime.
  • Links: LLaMa GPU inference issue, Model download URLs, LLaMa and RWKV structure comparison

Highlighted Details

  • Standalone ONNX Runtime demos for LLaMa and RWKV.
  • Memory pool support for low-RAM devices (e.g., 2GB RAM laptops).
  • Quantization table export from GPTQ-for-LLaMa.
  • ONNX models available in FP32 and FP16 precision.
  • Verified ONNX Runtime output against Torch-CUDA with a maximum error of 0.002.

Maintenance & Community

  • Project initiated April 5th.
  • Recent updates include RWKV-4 ONNX models and standalone scripts.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • License: GPLv3.
  • GPLv3 is a strong copyleft license, potentially restricting commercial use or linking with closed-source software without adhering to its terms.

Limitations & Caveats

The project notes that inference on 2GB RAM devices is "very slow." The conversion process for LLaMa ONNX models requires specific versions of the Transformers library and involves multiple manual steps. Mixed-precision kernel optimization is noted as "on the way."

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.