GPTFast  by MDK8888

HF Transformers accelerator for faster inference

created 1 year ago
685 stars

Top 50.5% on sourcepulse

GitHubView on GitHub
Project Summary

GPTFast provides techniques to accelerate Hugging Face Transformers models, targeting researchers and engineers seeking faster inference. It generalizes optimizations originally developed for Llama-2-7b to all Hugging Face models, offering significant speedups through methods like static key-value caching and speculative decoding.

How It Works

GPTFast integrates optimizations by modifying model forward passes and attention mechanisms. It uses a cache_config dictionary to specify how to inject static key-value caches, referencing model components and their forward pass logic. For speculative decoding, it requires a smaller "draft" model to generate candidate tokens, which are then verified by the main model, reducing computational load.

Quick Start & Requirements

  • Install: pip install gptfast
  • Requirements: Python >= 3.10, CUDA-enabled device.
  • Example usage and detailed documentation for gpt_fast, load_int8, add_kv_cache, and add_speculative_decoding are available in the README.

Highlighted Details

  • Achieves up to 9x inference acceleration with GPTQ int4 quantization and optimized kernels (v0.3.x).
  • Static key-value cache integration offers up to 8.5x speedup (v0.2.x).
  • Initial release (v0.1.x) provided 7x acceleration via torch.compile, int8 quantization, and speculative decoding.
  • Roadmap includes support for Medusa, Speculative Sampling, Eagle, various quantization methods (BitNet, AWQ, QoQ, GGUF, HQQ), and vLLM/FlashAttention integration.

Maintenance & Community

  • The project is actively developed, with recent releases (v0.3.x in June 2024).
  • Roadmap indicates ongoing feature development.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The provided documentation is marked as deprecated, with new documentation pending. The cache_config requires detailed knowledge of the target model's internal structure for customization.

Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

ctransformers by marella

0.1%
2k
Python bindings for fast Transformer model inference
created 2 years ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

gemma_pytorch by google

0.1%
6k
PyTorch implementation for Google's Gemma models
created 1 year ago
updated 2 months ago
Feedback? Help us improve.