pyllama by henrywoo

Hacked LLaMA version for single consumer-grade GPU inference

Created 2 years ago

2,799 stars

Top 16.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

This repository provides a modified implementation of Meta's LLaMA large language model, optimized for efficient execution on consumer-grade GPUs, including single 4GB VRAM devices. It targets researchers and developers needing to run LLMs locally with reduced hardware requirements, offering quantization and multi-GPU support.

How It Works

The core innovation lies in its quantization techniques (2, 3, 4, and 8-bit) applied to LLaMA models, significantly reducing memory footprint and enabling inference on lower-spec hardware. It leverages GPTQ for quantization and supports various calibration datasets (wikitext2, ptb, c4) for this process. The project also integrates with Hugging Face's accelerate for multi-GPU inference and provides utilities for model conversion and serving via Gradio or Flask.

Quick Start & Requirements

Install: pip install pyllama -U
Prerequisites: PyTorch with CUDA support.
Model Download: Requires filling a Google Form for official access or using a provided BitTorrent link/community download script. Total disk space for all models is 219GB.
Quantization: Requires pip install gptq and setting HUGGING_FACE_HUB_TOKEN.
Resources: Quantizing a 65B model to 4-bit took ~2.7 hours and reduced size from 122GB to 32GB. Inference on a 7B model is possible with as little as 3.2GB VRAM (2-bit quantization).
Links: Official LLaMA, Hugging Face LLaMA

Highlighted Details

Enables LLaMA inference on GPUs with as little as 4GB VRAM via 2-bit quantization.
Supports 4-bit quantization with a group size of 128 for improved stability and performance.
Offers utilities for converting models between Hugging Face and original Facebook formats.
Includes Gradio and Flask web UIs for easy deployment.

Maintenance & Community

The project is associated with the original LLaMA research. Community download methods and quantization scripts are provided.

Licensing & Compatibility

The project's license is not explicitly stated in the README, but it is based on Meta's LLaMA, which has a non-commercial license. Compatibility for commercial use is not implied.

Limitations & Caveats

Access to original LLaMA model weights requires approval via a Google Form, which may have restrictions. The README mentions a "hacked version" and "community way" for downloads, suggesting potential licensing ambiguities or reliance on unofficial sources.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days