simplified_transformers  by bobby-he

Research paper implementation for simplifying transformer blocks

Created 1 year ago
293 stars

Top 90.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides implementations for research on simplifying Transformer blocks and understanding/minimizing outlier features in Transformer training, targeting researchers and practitioners in deep learning. It offers code to reproduce experiments from two ICLR and NeurIPS 2024 papers, enabling exploration of novel Transformer architectures and training techniques.

How It Works

The codebase focuses on autoregressive language modeling, specifically next-token prediction using GPT-2. It implements simplified Transformer block variants, including parallel and "skipless" configurations, which deviate from standard attention and feed-forward sub-layer structures. These modifications aim to reduce computational complexity and improve training stability by addressing outlier features, a key contribution of the associated research.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Primary script: python run_clm.py
  • Prerequisites: hydra, wandb, torch, transformers, datasets, evaluate, accelerate.
  • Hardware: Single GPU recommended. A100/A5000 for ~10-hour training runs.
  • Data: Downloads automatically on first run.

Highlighted Details

  • Reproduces experiments from "Simplifying Transformer Blocks" (ICLR 2024) and "Understanding and Minimising Outlier Features in Transformer Training" (NeurIPS 2024).
  • Implements custom Transformer block variants: default-parallel, skipless, and skipless-parallel.
  • Configurable via Hydra for easy argument modification from the command line.
  • Uses wandb for logging by default; can be disabled with use_wandb=False.

Maintenance & Community

The project is associated with authors Bobby He and Thomas Hofmann. Further community engagement channels are not specified in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The codebase is primarily for reproducing research experiments and may require adaptation for general-purpose use. The kurtosis computation for outlier features uses variance of squared activations, differing by an additive constant of 1 from standard kurtosis, though this does not affect findings.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
4 more.

ml-cross-entropy by apple

0.4%
520
PyTorch module for memory-efficient cross-entropy in LLMs
Created 10 months ago
Updated 1 day ago
Starred by Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), Albert Gu Albert Gu(Cofounder of Cartesia; Professor at CMU), and
2 more.

Muon by KellerJordan

1.7%
2k
Optimizer for neural network hidden layers
Created 10 months ago
Updated 2 months ago
Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.