ProstT5 by mheinzinger

Bilingual protein language model for sequence/structure translation

Created 2 years ago

290 stars

Top 91.0% on SourcePulse

Project Summary

ProstT5 is a bilingual language model designed for protein sequence and structure translation, targeting researchers and bioinformaticians. It enables conversion between 1D protein sequences and 3D structural representations (3Di-tokens), facilitating tasks like embedding extraction and inverse folding.

How It Works

ProstT5 builds upon the ProtT5-XL-U50 model, which was pre-trained on billions of protein sequences using span corruption. It is then fine-tuned on 17 million proteins with high-quality 3D structure predictions from AlphaFoldDB. Protein structures are converted to a 1D representation using 3Di-tokens, enabling the T5 architecture to learn translations between these two modalities.

Quick Start & Requirements

Install via pip: pip install torch transformers sentencepiece protobuf (protobuf may be needed for older transformer versions).
Requires PyTorch and Hugging Face Transformers.
GPU with CUDA is recommended for performance; CPU usage is supported but significantly slower and requires full precision.
Example Colab notebooks and scripts are available for embedding extraction and inverse folding.

Highlighted Details

Translates between amino acid sequences and 3Di structural representations.
Can be used for generating protein embeddings.
Supports inverse folding (generating sequences from structures).
Leverages 3Di-tokens as introduced by Foldseek.

Maintenance & Community

Developed by Rostlab.
Training scripts and data are available.
A Zenodo backup of the model is provided.

Licensing & Compatibility

Released under the MIT license, permitting commercial use and closed-source linking.

Limitations & Caveats

Half-precision is only supported on GPUs.
A potential UnboundLocalError with older transformers versions might require installing protobuf or setting legacy=True.
3Di sequences derived from Foldseek need to be converted to lowercase to avoid tokenization issues.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days