GPTFast  by MDK8888

HF Transformers accelerator for faster inference

Created 2 years ago
684 stars

Top 49.4% on SourcePulse

GitHubView on GitHub
Project Summary

GPTFast provides techniques to accelerate Hugging Face Transformers models, targeting researchers and engineers seeking faster inference. It generalizes optimizations originally developed for Llama-2-7b to all Hugging Face models, offering significant speedups through methods like static key-value caching and speculative decoding.

How It Works

GPTFast integrates optimizations by modifying model forward passes and attention mechanisms. It uses a cache_config dictionary to specify how to inject static key-value caches, referencing model components and their forward pass logic. For speculative decoding, it requires a smaller "draft" model to generate candidate tokens, which are then verified by the main model, reducing computational load.

Quick Start & Requirements

  • Install: pip install gptfast
  • Requirements: Python >= 3.10, CUDA-enabled device.
  • Example usage and detailed documentation for gpt_fast, load_int8, add_kv_cache, and add_speculative_decoding are available in the README.

Highlighted Details

  • Achieves up to 9x inference acceleration with GPTQ int4 quantization and optimized kernels (v0.3.x).
  • Static key-value cache integration offers up to 8.5x speedup (v0.2.x).
  • Initial release (v0.1.x) provided 7x acceleration via torch.compile, int8 quantization, and speculative decoding.
  • Roadmap includes support for Medusa, Speculative Sampling, Eagle, various quantization methods (BitNet, AWQ, QoQ, GGUF, HQQ), and vLLM/FlashAttention integration.

Maintenance & Community

  • The project is actively developed, with recent releases (v0.3.x in June 2024).
  • Roadmap indicates ongoing feature development.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The provided documentation is marked as deprecated, with new documentation pending. The cache_config requires detailed knowledge of the target model's internal structure for customization.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.3%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 20 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.0%
6k
Optimized transformer library for inference
Created 5 years ago
Updated 2 years ago
Feedback? Help us improve.