ONNX models for LLaMa/RWKV, plus quantization and test cases
Top 78.7% on sourcepulse
This repository provides ONNX models for LLaMa and RWKV large language models, focusing on quantization and efficient inference. It targets developers and researchers aiming to deploy LLMs on diverse hardware, including resource-constrained devices, by leveraging ONNX for cross-platform compatibility and reduced model size.
How It Works
The project converts LLaMa and RWKV models into the ONNX format, enabling inference without PyTorch or Transformers dependencies. It supports quantization techniques, demonstrated by exporting quantization tables from GPTQ-for-LLaMa, and includes features like a memory pool to allow inference on systems with as little as 2GB of RAM, albeit with significantly reduced speed. The ONNX export process involves converting Hugging Face format models and then optionally converting them to FP16 precision.
Quick Start & Requirements
pip install -r requirements.txt
python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"
python3 demo_rwkv.py ${FP16_ONNX_DIR} "bonjour"
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project notes that inference on 2GB RAM devices is "very slow." The conversion process for LLaMa ONNX models requires specific versions of the Transformers library and involves multiple manual steps. Mixed-precision kernel optimization is noted as "on the way."
2 years ago
1 week