protein_bert by nadavbra

Protein language model for protein-related tasks

Created 4 years ago

568 stars

Top 56.6% on SourcePulse

Project Summary

ProteinBERT is a deep learning model for protein sequence analysis, offering state-of-the-art performance on various benchmarks. It's designed for researchers and developers working with protein data, enabling rapid training of protein predictors and feature extraction for downstream tasks.

How It Works

ProteinBERT is inspired by BERT but incorporates innovations like global-attention layers with linear complexity, allowing it to process extremely long protein sequences efficiently. It uses a self-supervised pretraining scheme combining language modeling with Gene Ontology (GO) annotation prediction. The model can accept protein sequences and optional GO annotations as input, producing both local and global representations.

Quick Start & Requirements

Install via python setup.py install after cloning the repository and initializing submodules.
Requires Python 3, TensorFlow (2.4.0), TensorFlow-Addons (0.12.1), NumPy (1.20.1), Pandas (1.2.3), h5py (3.2.1), lxml (4.3.2), and pyfaidx (0.5.8).
Pretrained models are available on Huggingface and Zenodo.
Full pretraining requires >1TB storage and can take weeks.
Demo notebook for fine-tuning: ProteinBERT demo

Highlighted Details

Processes protein sequences of virtually any length due to linear-complexity global attention.
Achieves state-of-the-art performance on diverse protein property benchmarks.
Can utilize GO annotations as auxiliary input for improved functional inference.
Offers both local and global sequence representations for feature extraction.

Maintenance & Community

The primary repository is nadavbra/protein_bert.
An unofficial PyTorch implementation is available at lucidrains/protein-bert-pytorch.
Citation: Brandes et al., Bioinformatics (2022).

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Pretraining from scratch is computationally intensive, requiring significant storage (>1TB) and time (weeks).
The provided installation instructions assume manual dependency management if not using the setup.py script.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days