ProstT5  by mheinzinger

Bilingual protein language model for sequence/structure translation

created 2 years ago
261 stars

Top 98.0% on sourcepulse

GitHubView on GitHub
Project Summary

ProstT5 is a bilingual language model designed for protein sequence and structure translation, targeting researchers and bioinformaticians. It enables conversion between 1D protein sequences and 3D structural representations (3Di-tokens), facilitating tasks like embedding extraction and inverse folding.

How It Works

ProstT5 builds upon the ProtT5-XL-U50 model, which was pre-trained on billions of protein sequences using span corruption. It is then fine-tuned on 17 million proteins with high-quality 3D structure predictions from AlphaFoldDB. Protein structures are converted to a 1D representation using 3Di-tokens, enabling the T5 architecture to learn translations between these two modalities.

Quick Start & Requirements

  • Install via pip: pip install torch transformers sentencepiece protobuf (protobuf may be needed for older transformer versions).
  • Requires PyTorch and Hugging Face Transformers.
  • GPU with CUDA is recommended for performance; CPU usage is supported but significantly slower and requires full precision.
  • Example Colab notebooks and scripts are available for embedding extraction and inverse folding.

Highlighted Details

  • Translates between amino acid sequences and 3Di structural representations.
  • Can be used for generating protein embeddings.
  • Supports inverse folding (generating sequences from structures).
  • Leverages 3Di-tokens as introduced by Foldseek.

Maintenance & Community

  • Developed by Rostlab.
  • Training scripts and data are available.
  • A Zenodo backup of the model is provided.

Licensing & Compatibility

  • Released under the MIT license, permitting commercial use and closed-source linking.

Limitations & Caveats

  • Half-precision is only supported on GPUs.
  • A potential UnboundLocalError with older transformers versions might require installing protobuf or setting legacy=True.
  • 3Di sequences derived from Foldseek need to be converted to lowercase to avoid tokenization issues.
Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

hyena-dna by HazyResearch

0.1%
704
Genomic foundation model for long-range DNA sequence modeling
created 2 years ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
1 more.

BioGPT by microsoft

0.1%
4k
BioGPT is a generative pre-trained transformer for biomedical text
created 3 years ago
updated 1 year ago
Feedback? Help us improve.