Protein embedding benchmark for semi-supervised learning tasks
Top 49.1% on sourcepulse
TAPE (Tasks Assessing Protein Embeddings) provides a comprehensive benchmark suite for evaluating protein language models. It offers a pretraining corpus, five downstream tasks (secondary structure prediction, contact prediction, remote homology detection, fluorescence, and stability), pretrained model weights, and benchmarking code, targeting researchers and practitioners in bioinformatics and computational biology.
How It Works
TAPE leverages a PyTorch-based framework, integrating with the Hugging Face API for easy loading of various pretrained protein models, including Transformer, UniRep, and trRosetta. It supports both unsupervised pretraining on large datasets like Pfam and supervised fine-tuning on specific biological tasks, enabling robust evaluation of transfer learning capabilities in protein representation learning.
Quick Start & Requirements
pip install tape_proteins
download_data.sh
or specified paths. Unsupervised Pfam dataset is ~19GB uncompressed.Highlighted Details
tape-embed
command for generating protein embeddings from FASTA files.Maintenance & Community
The project's README notes that direct training with TAPE's code is no longer recommended due to compatibility issues with newer PyTorch versions, suggesting the use of frameworks like PyTorch Lightning or Fairseq. The original TensorFlow repository is maintained separately.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. However, the project's components and datasets are intended for research use, with specific citation requirements for all data sources used.
Limitations & Caveats
The project explicitly states that this PyTorch version is not intended for maximum compatibility with the original paper's results, recommending the TensorFlow version for reproducibility. Training code is not actively maintained for future PyTorch versions, and issues related to multi-GPU or OOM errors during training are not being fixed. Documentation is also noted as incomplete.
2 years ago
Inactive