TriForce  by Infini-AI-Lab

Framework for lossless acceleration of long sequence generation

created 1 year ago
261 stars

Top 98.0% on sourcepulse

GitHubView on GitHub
Project Summary

TriForce accelerates long sequence generation for large language models through hierarchical speculative decoding, offering a training-free approach to improve efficiency. It targets researchers and practitioners working with long-context models who need to reduce latency and computational cost.

How It Works

TriForce employs a hierarchical speculative decoding strategy. It uses a small, fast "draft" model to generate multiple candidate tokens, which are then verified by a larger, more capable "target" model. This process is structured hierarchically, allowing for efficient verification of sequences. The framework also supports offloading KV cache to CPU memory, managed with tensor parallelism and CUDA graphs for optimized performance across different hardware configurations.

Quick Start & Requirements

  • Install: conda create -n TriForce python=3.9, conda activate TriForce, pip install -r requirements.txt, pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.9, CUDA 12.1 (for torch==2.2.1+cu121), flash_attn==2.5.7, transformers==4.37.2. Supports long-context Llama models (e.g., Llama2-7B-128K, LWM-Text-128K).
  • Resources: Requires significant GPU memory for on-chip operations and benefits from high PCIe bandwidth for offloading.
  • Docs: Paper, Blog

Highlighted Details

  • Achieves 2.2x speedup on a single A100 for 128K context length.
  • Supports offloading KV cache to CPU with tensor parallelism and CUDA Graph optimization.
  • Demonstrates performance on 2x RTX 4090s for offloading scenarios.
  • Includes baseline implementations for performance comparison.

Maintenance & Community

  • Authors affiliated with Carnegie Mellon University and Meta AI (FAIR).
  • Paper published on arXiv.
  • Issue #7 provides environment setup guidance.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • Compatibility issues noted with newer transformers versions; transformers==4.37.2 is required.
  • CUDA Graph support for tensor parallelism is limited to A100s, not RTX 4090s.
  • Offloading performance is sensitive to PCIe bandwidth.
Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
2 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 17 hours ago
Feedback? Help us improve.