ProtTrans  by agemagician

Pre-trained protein language models for bioinformatics & COVID-19 research

created 5 years ago
1,234 stars

Top 32.6% on sourcepulse

GitHubView on GitHub
Project Summary

ProtTrans provides state-of-the-art, Transformer-based language models for protein sequence analysis. It offers pre-trained models for feature extraction, fine-tuning, and prediction tasks, benefiting researchers and developers in bioinformatics and computational biology.

How It Works

ProtTrans leverages Transformer architectures (like T5 and BERT) trained on massive protein sequence datasets using high-performance computing. This self-supervised approach allows the models to learn rich representations of protein sequences, capturing complex biological patterns and relationships. The models are primarily encoder-based, with options for logits extraction.

Quick Start & Requirements

  • Install: pip install torch transformers sentencepiece
  • Prerequisites: PyTorch, Transformers library. Optional: protobuf for specific tokenizer versions. GPU recommended for performance.
  • Usage: Examples and Colab notebooks are provided for feature extraction, fine-tuning, and prediction. See Hugging Face for model availability.

Highlighted Details

  • Offers multiple pre-trained models (ProtT5, ProtBERT, ProtAlbert, etc.) trained on diverse datasets (UniRef50, BFD).
  • Achieves state-of-the-art performance on various downstream tasks like secondary structure prediction and subcellular localization.
  • Provides tools for feature extraction (embeddings), logits extraction, fine-tuning (including LoRA), and sequence generation.
  • Includes benchmarks and comparisons against other protein language models like ESM.

Maintenance & Community

The project is actively maintained by a large team from various institutions, including Technical University of Munich, Google, and Nvidia. Community contributions are welcomed via GitHub issues and pull requests.

Licensing & Compatibility

Models are released under the Academic Free License v3.0. This license is generally permissive for academic and research use but may have restrictions for commercial applications.

Limitations & Caveats

Some advanced features like sequence generation and visualization are still under development or have limited documentation. Half-precision mode is recommended for performance but may not be suitable for all hardware configurations.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
39 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.