ProteinWorkshop  by a-r-j

Benchmarking framework for protein representation learning

Created 2 years ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive benchmarking framework for protein representation learning, targeting researchers and practitioners in structural bioinformatics and machine learning. It offers a unified platform for evaluating various featurization schemes, datasets, and models, enabling reproducible research and facilitating the development of new protein representation learning methods.

How It Works

ProteinWorkshop employs a modular design, allowing users to combine different components for pre-training and downstream tasks. It supports both invariant and equivariant graph neural networks, along with a rich set of featurization schemes that capture varying levels of structural detail. The framework automates data downloading and processing, and its configuration-driven approach, powered by Hydra and Weights & Biases, simplifies experiment management and hyperparameter tuning.

Quick Start & Requirements

  • Installation: pip install proteinworkshop (for library usage) or clone and pip install -e . (for development).
  • Prerequisites: PyTorch (>= 2.1.2) with CUDA support is required. PyTorch Geometric is installed via workshop install pyg. Linux-like systems with NVIDIA CUDA are officially supported; Windows and macOS are not.
  • Data: Datasets are downloaded automatically on first use or via workshop download <DATASET_NAME>. The PDB dataset requires ~23 GB.
  • Documentation: Tutorials and Quickstart guides are available.

Highlighted Details

  • Includes a wide array of pre-training corpuses (e.g., AlphaFold DB, CATH, PDB) and supervised datasets for various tasks (e.g., binding site prediction, PPI).
  • Supports diverse featurization schemes, from simple residue types to complex dihedral angles and equivariant edge vectors.
  • Features implementations of state-of-the-art invariant and equivariant graph neural networks like DimeNet++, EGNN, and Tensor Field Networks.
  • Provides utilities for embedding generation, visualization, and model attribution (e.g., using integrated gradients).

Maintenance & Community

The project is associated with the ICLR 2024 paper "Evaluating Representation Learning on the Protein Structure Universe." Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

Licenses vary by dataset, including GPL-3.0, CC-BY 4.0, MIT, Apache 2.0, and CC0 1.0. The GPL-3.0 license for some datasets may impose copyleft restrictions on derivative works. Commercial use compatibility depends on the specific dataset licenses.

Limitations & Caveats

The framework officially supports only Linux-like systems with NVIDIA CUDA. Windows and macOS are not officially supported, which may hinder adoption on these platforms. Some datasets have large storage requirements (up to 1 TB).

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.