Hacked LLaMA version for single consumer-grade GPU inference
Top 17.4% on sourcepulse
This repository provides a modified implementation of Meta's LLaMA large language model, optimized for efficient execution on consumer-grade GPUs, including single 4GB VRAM devices. It targets researchers and developers needing to run LLMs locally with reduced hardware requirements, offering quantization and multi-GPU support.
How It Works
The core innovation lies in its quantization techniques (2, 3, 4, and 8-bit) applied to LLaMA models, significantly reducing memory footprint and enabling inference on lower-spec hardware. It leverages GPTQ for quantization and supports various calibration datasets (wikitext2, ptb, c4) for this process. The project also integrates with Hugging Face's accelerate
for multi-GPU inference and provides utilities for model conversion and serving via Gradio or Flask.
Quick Start & Requirements
pip install pyllama -U
pip install gptq
and setting HUGGING_FACE_HUB_TOKEN
.Highlighted Details
Maintenance & Community
The project is associated with the original LLaMA research. Community download methods and quantization scripts are provided.
Licensing & Compatibility
The project's license is not explicitly stated in the README, but it is based on Meta's LLaMA, which has a non-commercial license. Compatibility for commercial use is not implied.
Limitations & Caveats
Access to original LLaMA model weights requires approval via a Google Form, which may have restrictions. The README mentions a "hacked version" and "community way" for downloads, suggesting potential licensing ambiguities or reliance on unofficial sources.
1 year ago
1 day