PyTorch modules for modeling biological sequence data
Top 100.0% on sourcepulse
This repository provides PyTorch modules and utilities for modeling biological sequence data, targeting researchers and developers working with protein and biosynthetic gene cluster sequences. It offers pre-trained models for various tasks, including masked language modeling, inverse folding, and structure prediction, enabling efficient feature extraction and downstream analysis of biological sequences.
How It Works
The library implements several architectures, including ByteNet (CNN-based), Struct2SeqDecoder (GNN-based), and trRosetta, for processing biological sequences. It leverages masked language modeling (MLM) pre-training on large datasets like UniRef50 and antiSMASH, similar to BERT and ESM-1b. For structural modeling, it incorporates protein coordinates and multiple sequence alignments (MSAs) to predict inter-residue distances and dihedral angles.
Quick Start & Requirements
pip install sequence-models
or pip install git+https://github.com/microsoft/protein-sequence-models.git
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
r
for ByteNet, are marked as ???
, indicating they may require user definition or are not fully specified in the README.1 year ago
Inactive