LLM-Shearing by princeton-nlp

Code for LLM pre-training acceleration via structured pruning (ICLR 2024)

Created 2 years ago

635 stars

Top 52.3% on SourcePulse

View on GitHub

6 Experts Love This Project

Bryan Helmig

Cofounder of Zapier

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

Pawel Garbacki

Cofounder of Fireworks AI

Casper Hansen

Author of AutoAWQ

and 2 more!

Project Summary

This repository provides the codebase for Sheared-LLaMA, a structured pruning technique that significantly accelerates language model pre-training by creating smaller, performant models from larger ones. It targets researchers and practitioners aiming to develop efficient, smaller-scale LLMs without the prohibitive cost of training from scratch.

How It Works

Sheared-LLaMA leverages MosaicML's Composer package, implementing pruning and dynamic data loading as callbacks. The core idea is to prune existing large models (like LLaMA-2) to a target smaller architecture, achieving performance comparable to models trained from scratch but at a fraction of the cost. This approach integrates pruning directly into the training loop, allowing for efficient mask learning and model compression.

Quick Start & Requirements

Install: pip install -r requirement.txt and pip install -e . after installing PyTorch with CUDA 11.8 and Flash Attention 1.0.3.
Prerequisites: PyTorch (2.0.1+cu118), Flash Attention (1.0.3.post), Python 3.x. Flash Attention v2 is not supported.
Setup: Requires model weight conversion for Composer compatibility.
Links: ArXiv Preprint, Blog Post

Highlighted Details

Achieves a model as strong as OpenLLaMA-7B with 3% of the pre-training cost of LLaMA-2-7B.
Offers pre-trained and instruction-tuned models: Sheared-LLaMA-1.3B, Sheared-LLaMA-2.7B, and their instruction-tuned variants.
Supports pruning to custom target model shapes (hidden dimensions, layers, heads).
Includes dynamic data loading capabilities for adaptive training.

Maintenance & Community

Developed by Princeton University researchers.
Active development with releases in late 2023.
Issues can be opened on GitHub; contact: mengzhou@princeton.edu.

Licensing & Compatibility

The repository itself does not explicitly state a license. The underlying models (LLaMA) have specific usage terms.

Limitations & Caveats

Flash Attention v2 is not supported and may require manual modifications.
Dynamic data loading is limited to local data and single-worker dataloaders without prefetching.
autoresume compatibility is not guaranteed for the pruning stage.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days