protein-sequence-models  by microsoft

PyTorch modules for modeling biological sequence data

created 4 years ago
250 stars

Top 100.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides PyTorch modules and utilities for modeling biological sequence data, targeting researchers and developers working with protein and biosynthetic gene cluster sequences. It offers pre-trained models for various tasks, including masked language modeling, inverse folding, and structure prediction, enabling efficient feature extraction and downstream analysis of biological sequences.

How It Works

The library implements several architectures, including ByteNet (CNN-based), Struct2SeqDecoder (GNN-based), and trRosetta, for processing biological sequences. It leverages masked language modeling (MLM) pre-training on large datasets like UniRef50 and antiSMASH, similar to BERT and ESM-1b. For structural modeling, it incorporates protein coordinates and multiple sequence alignments (MSAs) to predict inter-residue distances and dihedral angles.

Quick Start & Requirements

  • Install via pip: pip install sequence-models or pip install git+https://github.com/microsoft/protein-sequence-models.git
  • Requires PyTorch (v1.9.0, v1.11.0, v1.12 tested).
  • Additional dependencies may include pandas, scipy, and wget.
  • GPU acceleration is recommended for performance.
  • Official documentation and pre-trained model weights are available via Zenodo.

Highlighted Details

  • Offers pre-trained Convolutional Autoencoding Representations of Proteins (CARP) models of various sizes (e.g., carp_640M).
  • Includes Masked Inverse Folding (MIF) and MIF with Sequence Transfer (MIF-ST) models for structure-aware sequence modeling.
  • Provides tools for bulk embedding extraction from FASTA and CSV files, with options for specific layers and output formats.
  • Implements ByteNet and ByteNet2d for sequence and 2D data processing, respectively, and trRosetta for MSA-based structure prediction.

Maintenance & Community

  • Developed by Microsoft.
  • Links to preprints describing CARP, MIF, and MIF-ST models are provided.
  • Training code for BiGCARP is available separately.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing terms.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial use or redistribution.
  • Some model parameters, like the dilation factor r for ByteNet, are marked as ???, indicating they may require user definition or are not fully specified in the README.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.