pyllama  by henrywoo

Hacked LLaMA version for single consumer-grade GPU inference

Created 2 years ago
2,803 stars

Top 17.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a modified implementation of Meta's LLaMA large language model, optimized for efficient execution on consumer-grade GPUs, including single 4GB VRAM devices. It targets researchers and developers needing to run LLMs locally with reduced hardware requirements, offering quantization and multi-GPU support.

How It Works

The core innovation lies in its quantization techniques (2, 3, 4, and 8-bit) applied to LLaMA models, significantly reducing memory footprint and enabling inference on lower-spec hardware. It leverages GPTQ for quantization and supports various calibration datasets (wikitext2, ptb, c4) for this process. The project also integrates with Hugging Face's accelerate for multi-GPU inference and provides utilities for model conversion and serving via Gradio or Flask.

Quick Start & Requirements

  • Install: pip install pyllama -U
  • Prerequisites: PyTorch with CUDA support.
  • Model Download: Requires filling a Google Form for official access or using a provided BitTorrent link/community download script. Total disk space for all models is 219GB.
  • Quantization: Requires pip install gptq and setting HUGGING_FACE_HUB_TOKEN.
  • Resources: Quantizing a 65B model to 4-bit took ~2.7 hours and reduced size from 122GB to 32GB. Inference on a 7B model is possible with as little as 3.2GB VRAM (2-bit quantization).
  • Links: Official LLaMA, Hugging Face LLaMA

Highlighted Details

  • Enables LLaMA inference on GPUs with as little as 4GB VRAM via 2-bit quantization.
  • Supports 4-bit quantization with a group size of 128 for improved stability and performance.
  • Offers utilities for converting models between Hugging Face and original Facebook formats.
  • Includes Gradio and Flask web UIs for easy deployment.

Maintenance & Community

The project is associated with the original LLaMA research. Community download methods and quantization scripts are provided.

Licensing & Compatibility

The project's license is not explicitly stated in the README, but it is based on Meta's LLaMA, which has a non-commercial license. Compatibility for commercial use is not implied.

Limitations & Caveats

Access to original LLaMA model weights requires approval via a Google Form, which may have restrictions. The README mentions a "hacked version" and "community way" for downloads, suggesting potential licensing ambiguities or reliance on unofficial sources.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.