simplified_transformers by bobby-he

Research paper implementation for simplifying transformer blocks

Created 2 years ago

293 stars

Top 90.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeremy Howard

Cofounder of fast.ai

Project Summary

This repository provides implementations for research on simplifying Transformer blocks and understanding/minimizing outlier features in Transformer training, targeting researchers and practitioners in deep learning. It offers code to reproduce experiments from two ICLR and NeurIPS 2024 papers, enabling exploration of novel Transformer architectures and training techniques.

How It Works

The codebase focuses on autoregressive language modeling, specifically next-token prediction using GPT-2. It implements simplified Transformer block variants, including parallel and "skipless" configurations, which deviate from standard attention and feed-forward sub-layer structures. These modifications aim to reduce computational complexity and improve training stability by addressing outlier features, a key contribution of the associated research.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Primary script: python run_clm.py
Prerequisites: hydra, wandb, torch, transformers, datasets, evaluate, accelerate.
Hardware: Single GPU recommended. A100/A5000 for ~10-hour training runs.
Data: Downloads automatically on first run.

Highlighted Details

Reproduces experiments from "Simplifying Transformer Blocks" (ICLR 2024) and "Understanding and Minimising Outlier Features in Transformer Training" (NeurIPS 2024).
Implements custom Transformer block variants: default-parallel, skipless, and skipless-parallel.
Configurable via Hydra for easy argument modification from the command line.
Uses wandb for logging by default; can be disabled with use_wandb=False.

Maintenance & Community

The project is associated with authors Bobby He and Thomas Hofmann. Further community engagement channels are not specified in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The codebase is primarily for reproducing research experiments and may require adaptation for general-purpose use. The kurtosis computation for outlier features uses variance of squared activations, differing by an additive constant of 1 from standard kurtosis, though this does not affect findings.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days