Research paper implementation for simplifying transformer blocks
Top 91.4% on sourcepulse
This repository provides implementations for research on simplifying Transformer blocks and understanding/minimizing outlier features in Transformer training, targeting researchers and practitioners in deep learning. It offers code to reproduce experiments from two ICLR and NeurIPS 2024 papers, enabling exploration of novel Transformer architectures and training techniques.
How It Works
The codebase focuses on autoregressive language modeling, specifically next-token prediction using GPT-2. It implements simplified Transformer block variants, including parallel and "skipless" configurations, which deviate from standard attention and feed-forward sub-layer structures. These modifications aim to reduce computational complexity and improve training stability by addressing outlier features, a key contribution of the associated research.
Quick Start & Requirements
pip install -r requirements.txt
python run_clm.py
hydra
, wandb
, torch
, transformers
, datasets
, evaluate
, accelerate
.Highlighted Details
wandb
for logging by default; can be disabled with use_wandb=False
.Maintenance & Community
The project is associated with authors Bobby He and Thomas Hofmann. Further community engagement channels are not specified in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.
Limitations & Caveats
The codebase is primarily for reproducing research experiments and may require adaptation for general-purpose use. The kurtosis computation for outlier features uses variance of squared activations, differing by an additive constant of 1 from standard kurtosis, though this does not affect findings.
7 months ago
1 week