ProtTrans by agemagician

Pre-trained protein language models for bioinformatics & COVID-19 research

Created 5 years ago

1,283 stars

Top 30.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Phil Wang

Prolific Research Paper Implementer

Project Summary

ProtTrans provides state-of-the-art, Transformer-based language models for protein sequence analysis. It offers pre-trained models for feature extraction, fine-tuning, and prediction tasks, benefiting researchers and developers in bioinformatics and computational biology.

How It Works

ProtTrans leverages Transformer architectures (like T5 and BERT) trained on massive protein sequence datasets using high-performance computing. This self-supervised approach allows the models to learn rich representations of protein sequences, capturing complex biological patterns and relationships. The models are primarily encoder-based, with options for logits extraction.

Quick Start & Requirements

Install: pip install torch transformers sentencepiece
Prerequisites: PyTorch, Transformers library. Optional: protobuf for specific tokenizer versions. GPU recommended for performance.
Usage: Examples and Colab notebooks are provided for feature extraction, fine-tuning, and prediction. See Hugging Face for model availability.

Highlighted Details

Offers multiple pre-trained models (ProtT5, ProtBERT, ProtAlbert, etc.) trained on diverse datasets (UniRef50, BFD).
Achieves state-of-the-art performance on various downstream tasks like secondary structure prediction and subcellular localization.
Provides tools for feature extraction (embeddings), logits extraction, fine-tuning (including LoRA), and sequence generation.
Includes benchmarks and comparisons against other protein language models like ESM.

Maintenance & Community

The project is actively maintained by a large team from various institutions, including Technical University of Munich, Google, and Nvidia. Community contributions are welcomed via GitHub issues and pull requests.

Licensing & Compatibility

Models are released under the Academic Free License v3.0. This license is generally permissive for academic and research use but may have restrictions for commercial applications.

Limitations & Caveats

Some advanced features like sequence generation and visualization are still under development or have limited documentation. Half-precision mode is recommended for performance but may not be suitable for all hardware configurations.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days