DejaVu  by FMInference

Research paper on efficient LLM inference via contextual sparsity

created 2 years ago
331 stars

Top 83.8% on sourcepulse

GitHubView on GitHub
Project Summary

DejaVu addresses the high computational cost of Large Language Model (LLM) inference by introducing contextual sparsity, enabling significant speedups without compromising model quality or in-context learning. It targets researchers and engineers working with LLMs who need to reduce inference latency on modern hardware.

How It Works

DejaVu leverages the insight that only a small subset of attention heads and MLP parameters are necessary for accurate inference on a given input. It employs a low-cost, on-the-fly prediction algorithm to identify these "contextual" sparse components for each layer. This approach avoids costly retraining or sacrificing LLM capabilities, offering wall-clock time speedups.

Quick Start & Requirements

  • Installation: Requires PyTorch 1.12.0+cu113, cupy-cuda11x==11.0.0, and NCCL for CUDA 11.x. Transformers library is also needed.
  • Prerequisites: CUDA 11.x is mandatory. Docker is recommended for latency benchmarking.
  • Setup: Involves collecting training data via shell scripts, converting Hugging Face checkpoints, and training sparsity predictors (attention and MLP).
  • Resources: Requires significant disk space for datasets (c4) and model checkpoints (e.g., OPT-175B).
  • Links: Paper: https://proceedings.mlr.press/v202/liu23am.html

Highlighted Details

  • Achieves over 2x latency reduction compared to FasterTransformer and over 6x compared to Hugging Face for OPT-175B.
  • Maintains model quality and in-context learning abilities.
  • Includes benchmarks for end-to-end accuracy (perplexity, downstream tasks) and generation latency.
  • Utilizes CUDA graphs for optimized latency benchmarking.

Maintenance & Community

The project is associated with FMInference and has a paper published at ICML 2023. No specific community channels (Discord/Slack) or active maintenance signals are explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup process is complex, requiring specific older versions of PyTorch and CUDA, and involves multiple data collection and training steps. The project relies on a separate repository (Decentralized_FM_alpha) for accuracy benchmarks. The latency benchmarking script for sparse MLP + sparse Attention blocks requires careful configuration of sparsity thresholds (mlp-K, att-K1, att-K2).

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.