lolcats by HazyResearch

Transform LLMs into subquadratic models with efficient linearization

Created 1 year ago

252 stars

Top 99.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Addair

Cofounder of Predibase

Project Summary

LoLCATs offers a novel method to transform existing large language models (LLMs) like Llama and Mistral into state-of-the-art subquadratic LLMs. This approach targets researchers and engineers seeking to improve LLM inference efficiency and training speed without significant quality degradation. The primary benefit is achieving significantly faster and more memory-efficient LLMs by linearizing their attention mechanisms.

How It Works

LoLCATs employs a two-stage process. First, "Attention Transfer" replaces the standard softmax attention layers with trainable linear attention analogs, trained to closely mimic the original softmax outputs. Second, "Low-rank Linearizing" uses low-rank adaptation (LoRA) to correct any approximation errors introduced in the first stage, thereby recovering model quality. This "Low-rank Linear Conversion via Attention Transfer" (LoLCATs) method effectively linearizes the attention mechanism, reducing its quadratic complexity to subquadratic.

Quick Start & Requirements

Primary Install: Use conda env create -f environment.yaml followed by conda activate lolcats-env.
Prerequisites: PyTorch with a compatible CUDA version (adjust environment.yaml as needed), Flash Attention 2, a C++ compiler for custom CUDA kernels, and a Hugging Face token for model downloads.
Resource Footprint: Training subquadratic Llama 3 8B and Mistral 7B models reportedly takes a couple of hours on a single 40GB A100 GPU.
Links: Hugging Face Space Demo: [link not provided in README], Paper: [link not provided in README], Blog Part 1: [link not provided in README], Blog Part 2: [link not provided in README].

Highlighted Details

Achieves state-of-the-art quality and training efficiency for subquadratic LLMs.
Supports linearization of various LLMs (e.g., Mistral, Llama 3/3.1) and different attention feature maps (T2R, Hedgehog).
Includes optimized CUDA kernels for causal linear attention and fused linear attention with sliding windows.
Provides sample checkpoints on Hugging Face and detailed evaluation scripts using LM Evaluation Harness.

Maintenance & Community

The repository is from "HazyResearch." Specific details regarding active maintenance, notable contributors, sponsorships, or community channels (like Discord/Slack) are not provided in the README. A lolcats-scaled branch indicates potential extensions for larger models.

Licensing & Compatibility

The README does not explicitly state the software license. This omission makes it impossible to determine compatibility for commercial use or closed-source linking without further clarification.

Limitations & Caveats

Custom CUDA kernel compilation may require careful matching of system CUDA versions and C++ compiler configurations. Debugging Hugging Face datasets errors might necessitate specific package versions (e.g., datasets==2.15.0). The absence of a stated license is a significant adoption blocker.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days