ProstT5  by mheinzinger

Bilingual protein language model for sequence/structure translation

Created 2 years ago
268 stars

Top 95.7% on SourcePulse

GitHubView on GitHub
Project Summary

ProstT5 is a bilingual language model designed for protein sequence and structure translation, targeting researchers and bioinformaticians. It enables conversion between 1D protein sequences and 3D structural representations (3Di-tokens), facilitating tasks like embedding extraction and inverse folding.

How It Works

ProstT5 builds upon the ProtT5-XL-U50 model, which was pre-trained on billions of protein sequences using span corruption. It is then fine-tuned on 17 million proteins with high-quality 3D structure predictions from AlphaFoldDB. Protein structures are converted to a 1D representation using 3Di-tokens, enabling the T5 architecture to learn translations between these two modalities.

Quick Start & Requirements

  • Install via pip: pip install torch transformers sentencepiece protobuf (protobuf may be needed for older transformer versions).
  • Requires PyTorch and Hugging Face Transformers.
  • GPU with CUDA is recommended for performance; CPU usage is supported but significantly slower and requires full precision.
  • Example Colab notebooks and scripts are available for embedding extraction and inverse folding.

Highlighted Details

  • Translates between amino acid sequences and 3Di structural representations.
  • Can be used for generating protein embeddings.
  • Supports inverse folding (generating sequences from structures).
  • Leverages 3Di-tokens as introduced by Foldseek.

Maintenance & Community

  • Developed by Rostlab.
  • Training scripts and data are available.
  • A Zenodo backup of the model is provided.

Licensing & Compatibility

  • Released under the MIT license, permitting commercial use and closed-source linking.

Limitations & Caveats

  • Half-precision is only supported on GPUs.
  • A potential UnboundLocalError with older transformers versions might require installing protobuf or setting legacy=True.
  • 3Di sequences derived from Foldseek need to be converted to lowercase to avoid tokenization issues.
Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Tri Dao Tri Dao(Chief Scientist at Together AI), and
1 more.

hnet by goombalab

1.5%
722
Hierarchical sequence modeling with dynamic chunking
Created 2 months ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.3%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.