ProteinWorkshop by a-r-j

Benchmarking framework for protein representation learning

Created 2 years ago

264 stars

Top 96.8% on SourcePulse

Project Summary

This repository provides a comprehensive benchmarking framework for protein representation learning, targeting researchers and practitioners in structural bioinformatics and machine learning. It offers a unified platform for evaluating various featurization schemes, datasets, and models, enabling reproducible research and facilitating the development of new protein representation learning methods.

How It Works

ProteinWorkshop employs a modular design, allowing users to combine different components for pre-training and downstream tasks. It supports both invariant and equivariant graph neural networks, along with a rich set of featurization schemes that capture varying levels of structural detail. The framework automates data downloading and processing, and its configuration-driven approach, powered by Hydra and Weights & Biases, simplifies experiment management and hyperparameter tuning.

Quick Start & Requirements

Installation: pip install proteinworkshop (for library usage) or clone and pip install -e . (for development).
Prerequisites: PyTorch (>= 2.1.2) with CUDA support is required. PyTorch Geometric is installed via workshop install pyg. Linux-like systems with NVIDIA CUDA are officially supported; Windows and macOS are not.
Data: Datasets are downloaded automatically on first use or via workshop download <DATASET_NAME>. The PDB dataset requires ~23 GB.
Documentation: Tutorials and Quickstart guides are available.

Highlighted Details

Includes a wide array of pre-training corpuses (e.g., AlphaFold DB, CATH, PDB) and supervised datasets for various tasks (e.g., binding site prediction, PPI).
Supports diverse featurization schemes, from simple residue types to complex dihedral angles and equivariant edge vectors.
Features implementations of state-of-the-art invariant and equivariant graph neural networks like DimeNet++, EGNN, and Tensor Field Networks.
Provides utilities for embedding generation, visualization, and model attribution (e.g., using integrated gradients).

Maintenance & Community

The project is associated with the ICLR 2024 paper "Evaluating Representation Learning on the Protein Structure Universe." Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

Licenses vary by dataset, including GPL-3.0, CC-BY 4.0, MIT, Apache 2.0, and CC0 1.0. The GPL-3.0 license for some datasets may impose copyleft restrictions on derivative works. Commercial use compatibility depends on the specific dataset licenses.

Limitations & Caveats

The framework officially supports only Linux-like systems with NVIDIA CUDA. Windows and macOS are not officially supported, which may hinder adoption on these platforms. Some datasets have large storage requirements (up to 1 TB).

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days