minAlphaFold2 by ChrisHayduk

Minimal PyTorch AlphaFold2 reimplementation for learning and research

Created 5 months ago

624 stars

Top 52.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

A minimal, pedagogical PyTorch reimplementation of AlphaFold2, designed for understanding and modification. It targets AI researchers and engineers seeking to demystify AlphaFold2's architecture and training, thereby accelerating progress in AI x biology by making the system accessible.

How It Works

The project offers a direct, 1-to-1 reimplementation of AlphaFold2's model and training pipeline using only core PyTorch primitives. Each module maps precisely to a numbered algorithm in the official supplement, prioritizing readability over production optimizations. This approach enables end-to-end training on a single GPU through gradient accumulation and checkpointing, making the complex system comprehensible within an afternoon.

Quick Start & Requirements

Installation: pip install -e '.[dev]'
Prerequisites: PyTorch, NumPy, pytest. Optional: OpenMM, pdbfixer for structure relaxation. Modal integration requires pip install -e '.[modal]'.
Hardware: Trainable on a single GPU; Modal runners support H200/A100 GPUs.
Setup: A 5-minute sanity check involves overfitting a single PDB on CPU. Full training requires significant time (~7 days on TPUv3 for Stage 1) and ~100 GB for data uploads to Modal.
Documentation: Primary reference is the AlphaFold2 supplement.

Highlighted Details

Pure PyTorch: Utilizes only nn.Linear, nn.LayerNorm, torch.einsum, and standard activations.
Supplement Mapping: Every file and module directly corresponds to a numbered algorithm in the AlphaFold2 supplement.
Paper-Spec Training: Implements the two-stage training recipe, including LR schedules, parameter EMA, and violation loss.
Single GPU Trainability: Achieved via gradient accumulation and gradient checkpointing.
Unit Consistency: Structure module operates internally in nanometers, with explicit conversions at boundaries.
Zero-Initialization: Implements zero-init for output projections and gate biases per §1.11.4.

Maintenance & Community

No specific details on active maintenance, community channels (like Discord/Slack), or a public roadmap are provided in the README.

Licensing & Compatibility

The project is licensed under MIT. Data derived from the AlphaFold2 source code is under Apache 2.0. The MIT license is permissive for commercial use and closed-source linking.

Limitations & Caveats

This is a pedagogical reimplementation, not an inference harness or speed benchmark. It supports monomers only and excludes multimer/AF3 architectures. Features like self-distillation dataset generation, custom MSA generation, and specific MMseqs2 clustering are out of scope. The structure relaxation procedure has caveats for highly violating inputs, recommending training with the violation loss active for cleaner outputs. The trainer lacks distributed data parallelism and relies on per-micro-batch gradient clipping for larger micro-batches.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days