Bilingual protein language model for sequence/structure translation
Top 98.0% on sourcepulse
ProstT5 is a bilingual language model designed for protein sequence and structure translation, targeting researchers and bioinformaticians. It enables conversion between 1D protein sequences and 3D structural representations (3Di-tokens), facilitating tasks like embedding extraction and inverse folding.
How It Works
ProstT5 builds upon the ProtT5-XL-U50 model, which was pre-trained on billions of protein sequences using span corruption. It is then fine-tuned on 17 million proteins with high-quality 3D structure predictions from AlphaFoldDB. Protein structures are converted to a 1D representation using 3Di-tokens, enabling the T5 architecture to learn translations between these two modalities.
Quick Start & Requirements
pip install torch transformers sentencepiece protobuf
(protobuf may be needed for older transformer versions).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
UnboundLocalError
with older transformers
versions might require installing protobuf
or setting legacy=True
.7 months ago
1 week