PyTorch text generation for efficient transformer inference
Top 8.7% on sourcepulse
This repository provides a highly efficient, PyTorch-native implementation for transformer text generation, targeting researchers and power users seeking maximum performance with minimal code. It achieves very low latency and high throughput for models like LLaMA and Mixtral using techniques such as int8/int4 quantization and speculative decoding, all within approximately 1000 lines of Python.
How It Works
The core approach leverages native PyTorch features and optimizations to deliver performance without external frameworks. Key techniques include int8 and int4 weight-only quantization for reduced memory footprint and faster computation, speculative decoding for improved generation speed by using a smaller draft model, and tensor parallelism for distributing model computations across multiple GPUs. This native PyTorch implementation aims for simplicity and direct control over the generation process.
Quick Start & Requirements
pip install -r requirements.txt
(after installing PyTorch nightly).sentencepiece
. Supports NVIDIA and AMD GPUs../scripts/prepare.sh <MODEL_REPO>
to convert Hugging Face models.python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
Highlighted Details
Maintenance & Community
The project is associated with pytorch-labs
and acknowledges contributions and inspiration from Lightning AI, GGML, Karpathy, and MLC-LLM. There are community projects inspired by gpt-fast
such as gpt-blazing
, gptfast
, and gpt-accelera
.
Licensing & Compatibility
Released under the BSD 3-Clause license, which permits commercial use and modification with attribution.
Limitations & Caveats
The project explicitly states it is not intended as a framework or library, encouraging direct code reuse. Generative tasks are not currently supported for evaluation via eval.py
. Benchmarks are run with a batch size of 1 and short prompts, which may not reflect performance in all scenarios.
3 months ago
1 week