gpt-fast  by pytorch-labs

PyTorch text generation for efficient transformer inference

created 1 year ago
6,040 stars

Top 8.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a highly efficient, PyTorch-native implementation for transformer text generation, targeting researchers and power users seeking maximum performance with minimal code. It achieves very low latency and high throughput for models like LLaMA and Mixtral using techniques such as int8/int4 quantization and speculative decoding, all within approximately 1000 lines of Python.

How It Works

The core approach leverages native PyTorch features and optimizations to deliver performance without external frameworks. Key techniques include int8 and int4 weight-only quantization for reduced memory footprint and faster computation, speculative decoding for improved generation speed by using a smaller draft model, and tensor parallelism for distributing model computations across multiple GPUs. This native PyTorch implementation aims for simplicity and direct control over the generation process.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (after installing PyTorch nightly).
  • Prerequisites: PyTorch nightly, sentencepiece. Supports NVIDIA and AMD GPUs.
  • Model Conversion: Use ./scripts/prepare.sh <MODEL_REPO> to convert Hugging Face models.
  • Example Generation: python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
  • Documentation: Blog post walkthrough available.

Highlighted Details

  • Achieves high tokens/second on LLaMA and Mixtral models, with significant speedups via 8-bit and 4-bit quantization.
  • Demonstrates strong performance with tensor parallelism, scaling up to 8 GPUs.
  • Supports speculative decoding with a draft model for further latency reduction.
  • Includes a pure PyTorch implementation of GPTQ quantization.

Maintenance & Community

The project is associated with pytorch-labs and acknowledges contributions and inspiration from Lightning AI, GGML, Karpathy, and MLC-LLM. There are community projects inspired by gpt-fast such as gpt-blazing, gptfast, and gpt-accelera.

Licensing & Compatibility

Released under the BSD 3-Clause license, which permits commercial use and modification with attribution.

Limitations & Caveats

The project explicitly states it is not intended as a framework or library, encouraging direct code reuse. Generative tasks are not currently supported for evaluation via eval.py. Benchmarks are run with a batch size of 1 and short prompts, which may not reflect performance in all scenarios.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
3
Star History
118 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Suale Hasif Suale Hasif(Cofounder of Cursor), and
1 more.

attorch by BobMcDear

0.3%
564
PyTorch nn module subset, implemented in Python using Triton
created 2 years ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 19 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
7 more.

pytorch-tutorial by yunjey

0.1%
32k
PyTorch tutorial for deep learning researchers
created 8 years ago
updated 1 year ago
Feedback? Help us improve.