tape by songlab-cal

Protein embedding benchmark for semi-supervised learning tasks

Created 6 years ago

727 stars

Top 47.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

TAPE (Tasks Assessing Protein Embeddings) provides a comprehensive benchmark suite for evaluating protein language models. It offers a pretraining corpus, five downstream tasks (secondary structure prediction, contact prediction, remote homology detection, fluorescence, and stability), pretrained model weights, and benchmarking code, targeting researchers and practitioners in bioinformatics and computational biology.

How It Works

TAPE leverages a PyTorch-based framework, integrating with the Hugging Face API for easy loading of various pretrained protein models, including Transformer, UniRep, and trRosetta. It supports both unsupervised pretraining on large datasets like Pfam and supervised fine-tuning on specific biological tasks, enabling robust evaluation of transfer learning capabilities in protein representation learning.

Quick Start & Requirements

Install: pip install tape_proteins
Prerequisites: Python, PyTorch. GPU recommended for training.
Data: Downloadable via download_data.sh or specified paths. Unsupervised Pfam dataset is ~19GB uncompressed.
Docs: https://github.com/songlab-cal/tape

Highlighted Details

Provides Hugging Face API integration for seamless model loading and caching.
Includes a tape-embed command for generating protein embeddings from FASTA files.
Supports distributed training with features like half-precision and gradient accumulation.
Offers a re-implementation of the trRosetta model with provided PyTorch code and data.

Maintenance & Community

The project's README notes that direct training with TAPE's code is no longer recommended due to compatibility issues with newer PyTorch versions, suggesting the use of frameworks like PyTorch Lightning or Fairseq. The original TensorFlow repository is maintained separately.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the project's components and datasets are intended for research use, with specific citation requirements for all data sources used.

Limitations & Caveats

The project explicitly states that this PyTorch version is not intended for maximum compatibility with the original paper's results, recommending the TensorFlow version for reproducibility. Training code is not actively maintained for future PyTorch versions, and issues related to multi-GPU or OOM errors during training are not being fixed. Documentation is also noted as incomplete.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days