GPTFast by MDK8888

HF Transformers accelerator for faster inference

Created 1 year ago

686 stars

Top 49.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Maxime Labonne

Head of Post-Training at Liquid AI

Jeremy Howard

Cofounder of fast.ai

Project Summary

GPTFast provides techniques to accelerate Hugging Face Transformers models, targeting researchers and engineers seeking faster inference. It generalizes optimizations originally developed for Llama-2-7b to all Hugging Face models, offering significant speedups through methods like static key-value caching and speculative decoding.

How It Works

GPTFast integrates optimizations by modifying model forward passes and attention mechanisms. It uses a cache_config dictionary to specify how to inject static key-value caches, referencing model components and their forward pass logic. For speculative decoding, it requires a smaller "draft" model to generate candidate tokens, which are then verified by the main model, reducing computational load.

Quick Start & Requirements

Install: pip install gptfast
Requirements: Python >= 3.10, CUDA-enabled device.
Example usage and detailed documentation for gpt_fast, load_int8, add_kv_cache, and add_speculative_decoding are available in the README.

Highlighted Details

Achieves up to 9x inference acceleration with GPTQ int4 quantization and optimized kernels (v0.3.x).
Static key-value cache integration offers up to 8.5x speedup (v0.2.x).
Initial release (v0.1.x) provided 7x acceleration via torch.compile, int8 quantization, and speculative decoding.
Roadmap includes support for Medusa, Speculative Sampling, Eagle, various quantization methods (BitNet, AWQ, QoQ, GGUF, HQQ), and vLLM/FlashAttention integration.

Maintenance & Community

The project is actively developed, with recent releases (v0.3.x in June 2024).
Roadmap indicates ongoing feature development.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The provided documentation is marked as deprecated, with new documentation pending. The cache_config requires detailed knowledge of the target model's internal structure for customization.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days