protein-sequence-models by microsoft

PyTorch modules for modeling biological sequence data

Created 4 years ago

254 stars

Top 99.1% on SourcePulse

Project Summary

This repository provides PyTorch modules and utilities for modeling biological sequence data, targeting researchers and developers working with protein and biosynthetic gene cluster sequences. It offers pre-trained models for various tasks, including masked language modeling, inverse folding, and structure prediction, enabling efficient feature extraction and downstream analysis of biological sequences.

How It Works

The library implements several architectures, including ByteNet (CNN-based), Struct2SeqDecoder (GNN-based), and trRosetta, for processing biological sequences. It leverages masked language modeling (MLM) pre-training on large datasets like UniRef50 and antiSMASH, similar to BERT and ESM-1b. For structural modeling, it incorporates protein coordinates and multiple sequence alignments (MSAs) to predict inter-residue distances and dihedral angles.

Quick Start & Requirements

Install via pip: pip install sequence-models or pip install git+https://github.com/microsoft/protein-sequence-models.git
Requires PyTorch (v1.9.0, v1.11.0, v1.12 tested).
Additional dependencies may include pandas, scipy, and wget.
GPU acceleration is recommended for performance.
Official documentation and pre-trained model weights are available via Zenodo.

Highlighted Details

Offers pre-trained Convolutional Autoencoding Representations of Proteins (CARP) models of various sizes (e.g., carp_640M).
Includes Masked Inverse Folding (MIF) and MIF with Sequence Transfer (MIF-ST) models for structure-aware sequence modeling.
Provides tools for bulk embedding extraction from FASTA and CSV files, with options for specific layers and output formats.
Implements ByteNet and ByteNet2d for sequence and 2D data processing, respectively, and trRosetta for MSA-based structure prediction.

Maintenance & Community

Developed by Microsoft.
Links to preprints describing CARP, MIF, and MIF-ST models are provided.
Training code for BiGCARP is available separately.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing terms.

Limitations & Caveats

The README does not specify a license, which may impact commercial use or redistribution.
Some model parameters, like the dilation factor r for ByteNet, are marked as ???, indicating they may require user definition or are not fully specified in the README.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days