simplified_transformers  by bobby-he

Research paper implementation for simplifying transformer blocks

created 1 year ago
292 stars

Top 91.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides implementations for research on simplifying Transformer blocks and understanding/minimizing outlier features in Transformer training, targeting researchers and practitioners in deep learning. It offers code to reproduce experiments from two ICLR and NeurIPS 2024 papers, enabling exploration of novel Transformer architectures and training techniques.

How It Works

The codebase focuses on autoregressive language modeling, specifically next-token prediction using GPT-2. It implements simplified Transformer block variants, including parallel and "skipless" configurations, which deviate from standard attention and feed-forward sub-layer structures. These modifications aim to reduce computational complexity and improve training stability by addressing outlier features, a key contribution of the associated research.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Primary script: python run_clm.py
  • Prerequisites: hydra, wandb, torch, transformers, datasets, evaluate, accelerate.
  • Hardware: Single GPU recommended. A100/A5000 for ~10-hour training runs.
  • Data: Downloads automatically on first run.

Highlighted Details

  • Reproduces experiments from "Simplifying Transformer Blocks" (ICLR 2024) and "Understanding and Minimising Outlier Features in Transformer Training" (NeurIPS 2024).
  • Implements custom Transformer block variants: default-parallel, skipless, and skipless-parallel.
  • Configurable via Hydra for easy argument modification from the command line.
  • Uses wandb for logging by default; can be disabled with use_wandb=False.

Maintenance & Community

The project is associated with authors Bobby He and Thomas Hofmann. Further community engagement channels are not specified in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The codebase is primarily for reproducing research experiments and may require adaptation for general-purpose use. The kurtosis computation for outlier features uses variance of squared activations, differing by an additive constant of 1 from standard kurtosis, though this does not affect findings.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
created 3 years ago
updated 2 years ago
Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

SwissArmyTransformer by THUDM

0.3%
1k
Transformer library for flexible model development
created 3 years ago
updated 7 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
6 more.

x-transformers by lucidrains

0.2%
5k
Transformer library with extensive experimental features
created 4 years ago
updated 3 days ago
Feedback? Help us improve.