pyllama  by henrywoo

Hacked LLaMA version for single consumer-grade GPU inference

created 2 years ago
2,800 stars

Top 17.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a modified implementation of Meta's LLaMA large language model, optimized for efficient execution on consumer-grade GPUs, including single 4GB VRAM devices. It targets researchers and developers needing to run LLMs locally with reduced hardware requirements, offering quantization and multi-GPU support.

How It Works

The core innovation lies in its quantization techniques (2, 3, 4, and 8-bit) applied to LLaMA models, significantly reducing memory footprint and enabling inference on lower-spec hardware. It leverages GPTQ for quantization and supports various calibration datasets (wikitext2, ptb, c4) for this process. The project also integrates with Hugging Face's accelerate for multi-GPU inference and provides utilities for model conversion and serving via Gradio or Flask.

Quick Start & Requirements

  • Install: pip install pyllama -U
  • Prerequisites: PyTorch with CUDA support.
  • Model Download: Requires filling a Google Form for official access or using a provided BitTorrent link/community download script. Total disk space for all models is 219GB.
  • Quantization: Requires pip install gptq and setting HUGGING_FACE_HUB_TOKEN.
  • Resources: Quantizing a 65B model to 4-bit took ~2.7 hours and reduced size from 122GB to 32GB. Inference on a 7B model is possible with as little as 3.2GB VRAM (2-bit quantization).
  • Links: Official LLaMA, Hugging Face LLaMA

Highlighted Details

  • Enables LLaMA inference on GPUs with as little as 4GB VRAM via 2-bit quantization.
  • Supports 4-bit quantization with a group size of 128 for improved stability and performance.
  • Offers utilities for converting models between Hugging Face and original Facebook formats.
  • Includes Gradio and Flask web UIs for easy deployment.

Maintenance & Community

The project is associated with the original LLaMA research. Community download methods and quantization scripts are provided.

Licensing & Compatibility

The project's license is not explicitly stated in the README, but it is based on Meta's LLaMA, which has a non-commercial license. Compatibility for commercial use is not implied.

Limitations & Caveats

Access to original LLaMA model weights requires approval via a Google Form, which may have restrictions. The README mentions a "hacked version" and "community way" for downloads, suggesting potential licensing ambiguities or reliance on unofficial sources.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
10 more.

qlora by artidoro

0.2%
11k
Finetuning tool for quantized LLMs
created 2 years ago
updated 1 year ago
Feedback? Help us improve.