LLM-Shearing  by princeton-nlp

Code for LLM pre-training acceleration via structured pruning (ICLR 2024)

created 1 year ago
626 stars

Top 53.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the codebase for Sheared-LLaMA, a structured pruning technique that significantly accelerates language model pre-training by creating smaller, performant models from larger ones. It targets researchers and practitioners aiming to develop efficient, smaller-scale LLMs without the prohibitive cost of training from scratch.

How It Works

Sheared-LLaMA leverages MosaicML's Composer package, implementing pruning and dynamic data loading as callbacks. The core idea is to prune existing large models (like LLaMA-2) to a target smaller architecture, achieving performance comparable to models trained from scratch but at a fraction of the cost. This approach integrates pruning directly into the training loop, allowing for efficient mask learning and model compression.

Quick Start & Requirements

  • Install: pip install -r requirement.txt and pip install -e . after installing PyTorch with CUDA 11.8 and Flash Attention 1.0.3.
  • Prerequisites: PyTorch (2.0.1+cu118), Flash Attention (1.0.3.post), Python 3.x. Flash Attention v2 is not supported.
  • Setup: Requires model weight conversion for Composer compatibility.
  • Links: ArXiv Preprint, Blog Post

Highlighted Details

  • Achieves a model as strong as OpenLLaMA-7B with 3% of the pre-training cost of LLaMA-2-7B.
  • Offers pre-trained and instruction-tuned models: Sheared-LLaMA-1.3B, Sheared-LLaMA-2.7B, and their instruction-tuned variants.
  • Supports pruning to custom target model shapes (hidden dimensions, layers, heads).
  • Includes dynamic data loading capabilities for adaptive training.

Maintenance & Community

  • Developed by Princeton University researchers.
  • Active development with releases in late 2023.
  • Issues can be opened on GitHub; contact: mengzhou@princeton.edu.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The underlying models (LLaMA) have specific usage terms.

Limitations & Caveats

  • Flash Attention v2 is not supported and may require manual modifications.
  • Dynamic data loading is limited to local data and single-worker dataloaders without prefetching.
  • autoresume compatibility is not guaranteed for the pruning stage.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
3 more.

LLaMA-Adapter by OpenGVLab

0.0%
6k
Efficient fine-tuning for instruction-following LLaMA models
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
10 more.

open_llama by openlm-research

0.0%
8k
Open-source reproduction of LLaMA models
created 2 years ago
updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
9 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
created 2 years ago
updated 1 year ago
Feedback? Help us improve.