gpt-fast by meta-pytorch

PyTorch text generation for efficient transformer inference

Created 2 years ago

6,175 stars

Top 8.2% on SourcePulse

View on GitHub

25 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Lei Zhang

Director Engineering AI at AMD

James Reed

Cofounder of Fireworks AI

Lianmin Zheng

Coauthor of SGLang, vLLM

and 21 more!

Project Summary

This repository provides a highly efficient, PyTorch-native implementation for transformer text generation, targeting researchers and power users seeking maximum performance with minimal code. It achieves very low latency and high throughput for models like LLaMA and Mixtral using techniques such as int8/int4 quantization and speculative decoding, all within approximately 1000 lines of Python.

How It Works

The core approach leverages native PyTorch features and optimizations to deliver performance without external frameworks. Key techniques include int8 and int4 weight-only quantization for reduced memory footprint and faster computation, speculative decoding for improved generation speed by using a smaller draft model, and tensor parallelism for distributing model computations across multiple GPUs. This native PyTorch implementation aims for simplicity and direct control over the generation process.

Quick Start & Requirements

Install: pip install -r requirements.txt (after installing PyTorch nightly).
Prerequisites: PyTorch nightly, sentencepiece. Supports NVIDIA and AMD GPUs.
Model Conversion: Use ./scripts/prepare.sh <MODEL_REPO> to convert Hugging Face models.
Example Generation: python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
Documentation: Blog post walkthrough available.

Highlighted Details

Achieves high tokens/second on LLaMA and Mixtral models, with significant speedups via 8-bit and 4-bit quantization.
Demonstrates strong performance with tensor parallelism, scaling up to 8 GPUs.
Supports speculative decoding with a draft model for further latency reduction.
Includes a pure PyTorch implementation of GPTQ quantization.

Maintenance & Community

The project is associated with pytorch-labs and acknowledges contributions and inspiration from Lightning AI, GGML, Karpathy, and MLC-LLM. There are community projects inspired by gpt-fast such as gpt-blazing, gptfast, and gpt-accelera.

Licensing & Compatibility

Released under the BSD 3-Clause license, which permits commercial use and modification with attribution.

Limitations & Caveats

The project explicitly states it is not intended as a framework or library, encouraging direct code reuse. Generative tasks are not currently supported for evaluation via eval.py. Benchmarks are run with a batch size of 1 and short prompts, which may not reflect performance in all scenarios.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days