DejaVu  by FMInference

Research paper on efficient LLM inference via contextual sparsity

Created 2 years ago
336 stars

Top 81.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DejaVu addresses the high computational cost of Large Language Model (LLM) inference by introducing contextual sparsity, enabling significant speedups without compromising model quality or in-context learning. It targets researchers and engineers working with LLMs who need to reduce inference latency on modern hardware.

How It Works

DejaVu leverages the insight that only a small subset of attention heads and MLP parameters are necessary for accurate inference on a given input. It employs a low-cost, on-the-fly prediction algorithm to identify these "contextual" sparse components for each layer. This approach avoids costly retraining or sacrificing LLM capabilities, offering wall-clock time speedups.

Quick Start & Requirements

  • Installation: Requires PyTorch 1.12.0+cu113, cupy-cuda11x==11.0.0, and NCCL for CUDA 11.x. Transformers library is also needed.
  • Prerequisites: CUDA 11.x is mandatory. Docker is recommended for latency benchmarking.
  • Setup: Involves collecting training data via shell scripts, converting Hugging Face checkpoints, and training sparsity predictors (attention and MLP).
  • Resources: Requires significant disk space for datasets (c4) and model checkpoints (e.g., OPT-175B).
  • Links: Paper: https://proceedings.mlr.press/v202/liu23am.html

Highlighted Details

  • Achieves over 2x latency reduction compared to FasterTransformer and over 6x compared to Hugging Face for OPT-175B.
  • Maintains model quality and in-context learning abilities.
  • Includes benchmarks for end-to-end accuracy (perplexity, downstream tasks) and generation latency.
  • Utilizes CUDA graphs for optimized latency benchmarking.

Maintenance & Community

The project is associated with FMInference and has a paper published at ICML 2023. No specific community channels (Discord/Slack) or active maintenance signals are explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup process is complex, requiring specific older versions of PyTorch and CUDA, and involves multiple data collection and training steps. The project relies on a separate repository (Decentralized_FM_alpha) for accuracy benchmarks. The latency benchmarking script for sparse MLP + sparse Attention blocks requires careful configuration of sparsity thresholds (mlp-K, att-K1, att-K2).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena), and
3 more.

sparseml by neuralmagic

0.1%
2k
Sparsification toolkit for optimized neural networks
Created 4 years ago
Updated 3 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.