gpt-fast  by meta-pytorch

PyTorch text generation for efficient transformer inference

Created 1 year ago
6,094 stars

Top 8.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a highly efficient, PyTorch-native implementation for transformer text generation, targeting researchers and power users seeking maximum performance with minimal code. It achieves very low latency and high throughput for models like LLaMA and Mixtral using techniques such as int8/int4 quantization and speculative decoding, all within approximately 1000 lines of Python.

How It Works

The core approach leverages native PyTorch features and optimizations to deliver performance without external frameworks. Key techniques include int8 and int4 weight-only quantization for reduced memory footprint and faster computation, speculative decoding for improved generation speed by using a smaller draft model, and tensor parallelism for distributing model computations across multiple GPUs. This native PyTorch implementation aims for simplicity and direct control over the generation process.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (after installing PyTorch nightly).
  • Prerequisites: PyTorch nightly, sentencepiece. Supports NVIDIA and AMD GPUs.
  • Model Conversion: Use ./scripts/prepare.sh <MODEL_REPO> to convert Hugging Face models.
  • Example Generation: python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
  • Documentation: Blog post walkthrough available.

Highlighted Details

  • Achieves high tokens/second on LLaMA and Mixtral models, with significant speedups via 8-bit and 4-bit quantization.
  • Demonstrates strong performance with tensor parallelism, scaling up to 8 GPUs.
  • Supports speculative decoding with a draft model for further latency reduction.
  • Includes a pure PyTorch implementation of GPTQ quantization.

Maintenance & Community

The project is associated with pytorch-labs and acknowledges contributions and inspiration from Lightning AI, GGML, Karpathy, and MLC-LLM. There are community projects inspired by gpt-fast such as gpt-blazing, gptfast, and gpt-accelera.

Licensing & Compatibility

Released under the BSD 3-Clause license, which permits commercial use and modification with attribution.

Limitations & Caveats

The project explicitly states it is not intended as a framework or library, encouraging direct code reuse. Generative tasks are not currently supported for evaluation via eval.py. Benchmarks are run with a batch size of 1 and short prompts, which may not reflect performance in all scenarios.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
52 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), and
2 more.

HunyuanVideo by Tencent-Hunyuan

0.2%
11k
PyTorch code for video generation research
Created 9 months ago
Updated 3 weeks ago
Feedback? Help us improve.