lolcats  by HazyResearch

Transform LLMs into subquadratic models with efficient linearization

Created 1 year ago
251 stars

Top 99.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LoLCATs offers a novel method to transform existing large language models (LLMs) like Llama and Mistral into state-of-the-art subquadratic LLMs. This approach targets researchers and engineers seeking to improve LLM inference efficiency and training speed without significant quality degradation. The primary benefit is achieving significantly faster and more memory-efficient LLMs by linearizing their attention mechanisms.

How It Works

LoLCATs employs a two-stage process. First, "Attention Transfer" replaces the standard softmax attention layers with trainable linear attention analogs, trained to closely mimic the original softmax outputs. Second, "Low-rank Linearizing" uses low-rank adaptation (LoRA) to correct any approximation errors introduced in the first stage, thereby recovering model quality. This "Low-rank Linear Conversion via Attention Transfer" (LoLCATs) method effectively linearizes the attention mechanism, reducing its quadratic complexity to subquadratic.

Quick Start & Requirements

  • Primary Install: Use conda env create -f environment.yaml followed by conda activate lolcats-env.
  • Prerequisites: PyTorch with a compatible CUDA version (adjust environment.yaml as needed), Flash Attention 2, a C++ compiler for custom CUDA kernels, and a Hugging Face token for model downloads.
  • Resource Footprint: Training subquadratic Llama 3 8B and Mistral 7B models reportedly takes a couple of hours on a single 40GB A100 GPU.
  • Links: Hugging Face Space Demo: [link not provided in README], Paper: [link not provided in README], Blog Part 1: [link not provided in README], Blog Part 2: [link not provided in README].

Highlighted Details

  • Achieves state-of-the-art quality and training efficiency for subquadratic LLMs.
  • Supports linearization of various LLMs (e.g., Mistral, Llama 3/3.1) and different attention feature maps (T2R, Hedgehog).
  • Includes optimized CUDA kernels for causal linear attention and fused linear attention with sliding windows.
  • Provides sample checkpoints on Hugging Face and detailed evaluation scripts using LM Evaluation Harness.

Maintenance & Community

The repository is from "HazyResearch." Specific details regarding active maintenance, notable contributors, sponsorships, or community channels (like Discord/Slack) are not provided in the README. A lolcats-scaled branch indicates potential extensions for larger models.

Licensing & Compatibility

The README does not explicitly state the software license. This omission makes it impossible to determine compatibility for commercial use or closed-source linking without further clarification.

Limitations & Caveats

Custom CUDA kernel compilation may require careful matching of system CUDA versions and C++ compiler configurations. Debugging Hugging Face datasets errors might necessitate specific package versions (e.g., datasets==2.15.0). The absence of a stated license is a significant adoption blocker.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
15 more.

LoRA by microsoft

0.2%
13k
PyTorch library for low-rank adaptation (LoRA) of LLMs
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.