TriForce  by Infini-AI-Lab

Framework for lossless acceleration of long sequence generation

Created 1 year ago
263 stars

Top 96.9% on SourcePulse

GitHubView on GitHub
Project Summary

TriForce accelerates long sequence generation for large language models through hierarchical speculative decoding, offering a training-free approach to improve efficiency. It targets researchers and practitioners working with long-context models who need to reduce latency and computational cost.

How It Works

TriForce employs a hierarchical speculative decoding strategy. It uses a small, fast "draft" model to generate multiple candidate tokens, which are then verified by a larger, more capable "target" model. This process is structured hierarchically, allowing for efficient verification of sequences. The framework also supports offloading KV cache to CPU memory, managed with tensor parallelism and CUDA graphs for optimized performance across different hardware configurations.

Quick Start & Requirements

  • Install: conda create -n TriForce python=3.9, conda activate TriForce, pip install -r requirements.txt, pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.9, CUDA 12.1 (for torch==2.2.1+cu121), flash_attn==2.5.7, transformers==4.37.2. Supports long-context Llama models (e.g., Llama2-7B-128K, LWM-Text-128K).
  • Resources: Requires significant GPU memory for on-chip operations and benefits from high PCIe bandwidth for offloading.
  • Docs: Paper, Blog

Highlighted Details

  • Achieves 2.2x speedup on a single A100 for 128K context length.
  • Supports offloading KV cache to CPU with tensor parallelism and CUDA Graph optimization.
  • Demonstrates performance on 2x RTX 4090s for offloading scenarios.
  • Includes baseline implementations for performance comparison.

Maintenance & Community

  • Authors affiliated with Carnegie Mellon University and Meta AI (FAIR).
  • Paper published on arXiv.
  • Issue #7 provides environment setup guidance.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • Compatibility issues noted with newer transformers versions; transformers==4.37.2 is required.
  • CUDA Graph support for tensor parallelism is limited to A100s, not RTX 4090s.
  • Offloading performance is sensitive to PCIe bandwidth.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

GPTFast by MDK8888

0%
687
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lei Zhang Lei Zhang(Director Engineering AI at AMD), and
23 more.

gpt-fast by meta-pytorch

0.2%
6k
PyTorch text generation for efficient transformer inference
Created 1 year ago
Updated 3 weeks ago
Feedback? Help us improve.