LLM-Shearing  by princeton-nlp

Code for LLM pre-training acceleration via structured pruning (ICLR 2024)

Created 1 year ago
631 stars

Top 52.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the codebase for Sheared-LLaMA, a structured pruning technique that significantly accelerates language model pre-training by creating smaller, performant models from larger ones. It targets researchers and practitioners aiming to develop efficient, smaller-scale LLMs without the prohibitive cost of training from scratch.

How It Works

Sheared-LLaMA leverages MosaicML's Composer package, implementing pruning and dynamic data loading as callbacks. The core idea is to prune existing large models (like LLaMA-2) to a target smaller architecture, achieving performance comparable to models trained from scratch but at a fraction of the cost. This approach integrates pruning directly into the training loop, allowing for efficient mask learning and model compression.

Quick Start & Requirements

  • Install: pip install -r requirement.txt and pip install -e . after installing PyTorch with CUDA 11.8 and Flash Attention 1.0.3.
  • Prerequisites: PyTorch (2.0.1+cu118), Flash Attention (1.0.3.post), Python 3.x. Flash Attention v2 is not supported.
  • Setup: Requires model weight conversion for Composer compatibility.
  • Links: ArXiv Preprint, Blog Post

Highlighted Details

  • Achieves a model as strong as OpenLLaMA-7B with 3% of the pre-training cost of LLaMA-2-7B.
  • Offers pre-trained and instruction-tuned models: Sheared-LLaMA-1.3B, Sheared-LLaMA-2.7B, and their instruction-tuned variants.
  • Supports pruning to custom target model shapes (hidden dimensions, layers, heads).
  • Includes dynamic data loading capabilities for adaptive training.

Maintenance & Community

  • Developed by Princeton University researchers.
  • Active development with releases in late 2023.
  • Issues can be opened on GitHub; contact: mengzhou@princeton.edu.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The underlying models (LLaMA) have specific usage terms.

Limitations & Caveats

  • Flash Attention v2 is not supported and may require manual modifications.
  • Dynamic data loading is limited to local data and single-worker dataloaders without prefetching.
  • autoresume compatibility is not guaranteed for the pruning stage.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
265
Efficiently train foundation models with PyTorch
Created 1 year ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
2 more.

sparsegpt by IST-DASLab

0.5%
836
Code for massive language model one-shot pruning (ICML 2023 paper)
Created 2 years ago
Updated 1 year ago
Starred by Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

wanda by locuslab

0.4%
802
LLM pruning research paper implementation
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.