protein_bert  by nadavbra

Protein language model for protein-related tasks

created 4 years ago
542 stars

Top 59.5% on sourcepulse

GitHubView on GitHub
Project Summary

ProteinBERT is a deep learning model for protein sequence analysis, offering state-of-the-art performance on various benchmarks. It's designed for researchers and developers working with protein data, enabling rapid training of protein predictors and feature extraction for downstream tasks.

How It Works

ProteinBERT is inspired by BERT but incorporates innovations like global-attention layers with linear complexity, allowing it to process extremely long protein sequences efficiently. It uses a self-supervised pretraining scheme combining language modeling with Gene Ontology (GO) annotation prediction. The model can accept protein sequences and optional GO annotations as input, producing both local and global representations.

Quick Start & Requirements

  • Install via python setup.py install after cloning the repository and initializing submodules.
  • Requires Python 3, TensorFlow (2.4.0), TensorFlow-Addons (0.12.1), NumPy (1.20.1), Pandas (1.2.3), h5py (3.2.1), lxml (4.3.2), and pyfaidx (0.5.8).
  • Pretrained models are available on Huggingface and Zenodo.
  • Full pretraining requires >1TB storage and can take weeks.
  • Demo notebook for fine-tuning: ProteinBERT demo

Highlighted Details

  • Processes protein sequences of virtually any length due to linear-complexity global attention.
  • Achieves state-of-the-art performance on diverse protein property benchmarks.
  • Can utilize GO annotations as auxiliary input for improved functional inference.
  • Offers both local and global sequence representations for feature extraction.

Maintenance & Community

  • The primary repository is nadavbra/protein_bert.
  • An unofficial PyTorch implementation is available at lucidrains/protein-bert-pytorch.
  • Citation: Brandes et al., Bioinformatics (2022).

Licensing & Compatibility

  • Licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

  • Pretraining from scratch is computationally intensive, requiring significant storage (>1TB) and time (weeks).
  • The provided installation instructions assume manual dependency management if not using the setup.py script.
Health Check
Last commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.